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Project  Goal 


Address  the  fundamental  problem  of  modeling  and  solving  “communities”  of  tasks  from 
a  cognitive  point  of  view  through  multiple  problem  solving  agents  working  cooperatively 
or  competitively  on  different  subtasks  at  multiple  levels  of  granularity. 

•  Agents  are  naturally  grouped  into  hierarchies  or  communities,  and  such  groupings 
may  occur  dynamically 

•  Cognition  model  is  based  on  a  strong  global  coordination  mechanism  that  relies 
on  “focus”  in  order  to  elevate  a  low-power  agent  into  a  full-scale  “thinking”  agent 
o  Dynamic  redistribution  of  “brain  thinking  power” 


Project  Members 

•  Dr.  Eugene  Santos  Jr.  &  Kiley  McEvoy  -  Dartmouth  College 

o  Contributions 

■  Model  architecture  definition 

■  Conduction  of  small  scale  implementation  and  testing 

•  Dr.  Nael  Abu-Ghazaleh  &  Vinay  Kolar  -  SUNY  Binghamton 

o  Contributions 

■  Multi-agent  development 

■  Component  communication  development 

■  Large  scale  deployment 

•  Dr.  Mark  Zhang  &  Zhen  Guo  -  SUNY  Binghamton 

o  Contributions 

■  Community  Generation  Theory  development 

■  Task  relationship  identification 


Project  Summary 

In  its  grandest  sense,  Project  CASIE  explored  the  development  of  a  computational  system 
capable  of  high  level  perception  and  problem  solving  that  reflects  the  cognitive  processes 
of  the  human  brain.  Most  specifically,  we  concentrated  on  better  understanding  and 
modeling  intuition  and  insight  in  a  computational  fashion. 

The  human  brain  utilizes  a  wide  variety  of  methods  in  order  to  comprehend  and  solve  the 
various  problems  faced  on  a  regular  basis.  Much  research  has  investigated  the  use  of 
individual  mechanisms  in  single-domain  puzzle-type  problems,  but  relatively  little  work 
has  explored  the  dynamic  use  of  multiple  methods  that  is  required  in  most  real  world 
applications.  Advanced  abilities  such  as  insight  and  creativity  are  inherently  used  to 
solve  multi-domain  problems.  Despite  the  ubiquity  of  these  activities,  their  inherent 
mystique  and  spontaneity  render  their  characterization  difficult  through  conventional 
methods.  This  work  serves  to  explore  various  levels  of  problem  solving  as  a  result  of  the 
dynamic  utilization  of  a  coordinated  set  of  specialized  mechanisms.  It  is  hypothesized 
that  the  ability  of  the  mind  to  dynamically  handle  complex  problems  is  dependent  on  the 
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elegant  structure  of  memory,  an  overseeing  control,  and  ubiquitous  events  such  as  mind¬ 
wandering  that  occur  during  thought.  To  demonstrate  these  theories,  a  cognitive 
architecture  has  been  designed  through  the  conflation  and  further  development  of  current 
problem  solving  theories  from  various  research  communities.  The  developed  cognitive 
architecture  has  been  implemented  in  a  computational  environment  for  testing  using  the 
real  world  application  of  medical  diagnosis.  Experimental  results  demonstrate  how  the 
coordination  of  various  types  of  thought,  including  mind-wandering,  can  contribute  to 
higher-level  problem  solving  events  such  as  insight.  It  was  the  ultimate  goal  of  this  work 
to  provide  a  strong  foundation  for  future  research  in  holistic  cognitive  architectures  and 
high-level  problem  solving. 

While  models  are  successful,  they  fail  to  reflect  some  of  the  inherent  mechanisms  of  the 
brain  that  may  be  essential  to  real  world  problem  solving.  Many  existing  models  view 
solving  as  a  goal  driven,  top  down  process  that  is  able  to  work  through  problems  with 
efficiency  and  accuracy.  They  do  not  include  ubiquitous  events,  such  as  mind  wandering 
and  attention  to  external  stimuli  that  occur  during  problem  solving.  This  is  due  to  the 
notion  that  such  disturbances  contain  task-unrelated  thought  [1],  While  this  may  be  true, 
various  accounts  of  insight  have  shown  that  complex  ideas  and  solutions  can  result  from 
mind  wandering  or  in  response  to  external  stimuli  [2].  In  order  for  this  to  occur,  the  mind 
must  be  able  to  dynamically  divert  attention  to  thoughts  that  may  be  relevant  to  either 
active  or  dormant  problems.  This  ability  would  require  several  specialized  processes 
operating  simultaneously.  It  is  our  hypothesis  that  productive  human  cognition  is  the 
result  of  the  cooperation  between  multiple  parallel  functions,  governed  by  a  global 
coordination  mechanism.  This  hypothesis  will  be  developed  through  the  explanation  of 
theories  for  the  coordinated  use  of  various  types  of  thought  as  well  as  their  proposed 
involvement  in  higher-level  forms  of  problem  solving. 

The  term  task-unrelated  thought  is  generally  used  to  describe  brain  activity  not  associated 
with  the  current  goal;  for  example  a  day-dream  [3],  However,  various  studies  have 
demonstrated  the  functional  similarities  between  task-related  and  task-unrelated  thought 
[4].  Neuro imaging  findings  show  that  the  patterns  of  neuron  activation  during  wandering 
thought  strongly  overlap  those  observed  during  active  problem  solving.  Findings  also 
demonstrate  that  these  two  types  of  thought  are  proportional  to  one  another.  As  task 
demand  increases,  the  evidence  of  spontaneous  unrelated  thought  decreases  [5].  These 
studies  suggest  that  task-related  and  task-unrelated  thought  compete  for  control  of 
common  resources  in  order  to  perform  their  function.  Our  interpretation,  however,  is  the 
opposite.  We  believe  that  the  common  resources  actually  make  up  a  mechanism  able  to 
coordinate  both  types  of  thought.  Furthennore,  we  feel  that  the  two  types  of  thought  are 
not  independent  and  are  both  utilized  during  problem  solving.  Thus,  we  reject  the 
notions  of  related  and  unrelated  and  refer  to  them  as  rational  and  intuitive  thought. 

The  ability  to  direct  one’s  thought  in  a  goal  oriented  manner  is  what  allows  us  to 
productively  interact  with  our  surroundings.  Using  logic  and  reason,  one  is  able  to  make 
decisions,  infer  relationships,  and  manipulate  thought.  Naturally,  these  abilities  play  a 
large  role  in  problem  solving.  Upon  encountering  a  problem,  one  must  develop  an 
understanding  of  the  situation  and  properly  select  and  execute  an  appropriate  strategy.  In 
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common  language,  the  tenn  rational  is  associated  with  one’s  behavior  rather  than  the 
underlying  thought.  For  this  project,  it  will  be  used  to  describe  a  deliberate  thought  or 
action  that  is  “consistent  with  or  based  on  reason’’  [6]. 

Intuition  is  defined  as  “the  act  or  faculty  of  knowing  or  sensing  without  the  use  of 
rational  processes”  [6].  Though  the  tenn  is  commonly  associated  with  the  spontaneous 
appearance  of  thoughts  relative  to  an  active  problem,  we  will  include  the  recollection  of 
seemingly  irrelevant  or  irrational  thought.  When  not  involved  with  a  computationally 
intensive  task,  one  may  find  themselves  humming  a  random  song  or  suddenly  recalling  a 
childhood  memory.  If  asked,  the  source  or  reason  for  such  thoughts  cannot  be  explained. 
In  some  instances,  the  unrequested  thoughts  consume  all  of  one’s  consciousness  and  can 
be  focused  on.  Other  times,  they  seem  to  occur  in  the  background  of  one’s  mind,  barely 
perceivable.  This  feeling  is  often  experienced  immediately  prior  to  recalling  a  necessary 
bit  of  information,  such  as  a  word  to  describe  a  situation.  One  may  have  a  strong  feeling 
of  awareness  for  a  word  matching  the  scenario,  but  can  not  immediately  verbalize  it. 

It  is  within  our  hypothesis  that  both  problem-relevant  and  problem-irrelevant  intuitions 
occur  through  the  same  mechanism.  We  suggest  that  intuitive  thought  occurs  due  to 
associations  between  concepts  on  a  neurological  level.  One  can  agree  that  a  particular 
stimulus  such  as  the  smell  of  the  ocean  is  capable  of  eliciting  memories  of  past 
experiences  involving  a  beach.  We  will  subscribe  to  the  well-supported  belief  that  this  is 
due  to  memories  existing  as  overlapping  neural  networks  within  the  brain  that  host  our 
experiences  within  their  connected  structure  [7].  Based  on  the  principle  of  synchronous 
convergence,  networks  that  are  active  simultaneously  will  form  a  connection  and  later 
will  be  capable  of  activating  one  another.  In  other  words,  concepts  existing  in 
consciousness  together  will  be  encoded  into  memory  with  an  association.  Thus,  one  is 
prone  to  develop  an  association  with  the  smell  of  the  ocean  and  the  visual  representation 
of  the  beach. 

As  rational  and  intuitive  thought  differ  greatly  in  behavior,  the  mind  would  be  very 
limited  if  only  one  existed.  The  abilities  of  both  mechanisms  must  be  coordinated  in 
order  for  the  brain  to  productively  interact  with  a  dynamic  environment.  We  pose  that 
this  coordination  is  managed  by  an  overseeing  cognitive  mechanism,  which  will  be 
referred  to  as  meta-cognition.  The  tenn  has  been  used  by  many  as  a  buzzword  in 
experimental  education  and  psychology  to  describe  the  ability  to  stay  on  task.  For  the 
purposes  of  this  project,  a  more  specific  definition  for  meta-cognition  will  be  used, 
identifying  its  abilities  as  “control  of  learning,  planning  and  selecting  strategies, 
monitoring  the  progress  of  learning,  correcting  errors,  analyzing  the  effectiveness  of 
learning  strategies,  and  changing  learning  behaviors  and  strategies  when  necessary.”  [8] 
These  abilities  will  be  extended  to  include  control  of  multiple  tasks  and  monitoring  semi¬ 
conscious  thought.  To  describe  the  interaction  of  meta-cognition,  we  pose  a  spectrum  of 
coordination  spanning  proportional  levels  of  rational  and  intuitive  thought. 

A  state  consisting  primarily  of  rational  thought  is  usually  entered  following  the  discovery 
of  a  relevant  procedure  that  can  now  be  applied.  For  example,  in  working  on  a  long 
division  problem,  the  method  is  known  and  solving  the  problem  is  simply  a  matter  of 
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computation.  At  this  state,  meta-cognition  has  allocated  the  majority  of  its  attention 
away  from  intuitive  thought.  Studies  have  shown  that  during  intense  task  related  thought, 
there  is  an  absence  of  activity  in  brain  regions  associated  with  monitoring  of  sensory 
information  [9].  Thus,  the  brain  is  less  subject  to  distraction  from  external  stimuli. 

When  engaged  in  primarily  intuitive  thought,  meta-cognition  allows  thought  to  flow 
freely  in  response  to  concept  activations  from  both  external  and  internal  sources.  This 
end  of  the  spectrum  is  representative  of  “day-dreaming”  or  “mind- wandering”.  For 
example,  one  might  be  reading  about  insects  and  begin  to  think  about  beetles,  followed 
by  a  daydream  of  playing  on  stage  with  John  Lennon.  Though  this  type  of  thought  may 
not  be  working  on  a  particular  task,  brain  regions  typically  associated  with  problem 
solving  are  occasionally  recruited  during  intuitive  thought  [4].  It  is  believed  this  is  to 
evaluate  and  retrieve  factual  information  as  needed  in  day-dreaming.  This  type  of  free 
flowing  thought  often  leads  into  task  oriented  thought,  particularly  when  the  mind 
encounters  a  subject  of  interest. 

Collaborative  use  of  rational  and  intuitive  thought  occurs  when  the  mind  is  working 
towards  a  complex  goal,  being  one  that  requires  the  solving  of  more  than  one  sub-goal. 
When  a  problem  is  encountered,  rational  thought  is  used  in  attempt  to  expand  and  build 
the  problem  through  logic  and  reasoning.  As  rational  thought  traverses  memory, 
networks  will  be  activated,  based  on  the  simultaneous  convergence  principle.  These 
activations  will  potentially  trigger  intuitive  thoughts.  The  extent  of  intuition  is  correlated 
with  the  activity  of  rational  thought.  If  rational  thought  moves  quickly  through  the  mind, 
such  as  in  solving  a  familiar  task,  there  is  less  chance  for  distantly  connected  networks  to 
become  activated,  limiting  abstractly  related  solutions.  Conversely,  if  rational  thought  is 
working  slowly,  such  as  when  one  is  unsure  how  to  solve  a  problem,  distantly  related 
networks  have  a  greater  chance  of  being  activated  through  intuition.  As  intuitive  thought 
occurs,  it  is  observed  by  meta-cognition.  If  the  activated  concept  is  thought  to  be  of 
interest,  meta-cognition  will  redirect  the  global  focus  to  the  newly  activated  idea. 

There  are  several  conflicting  views  as  to  what  types  of  problem  solving  can  be  classified 
as  insight.  Some  feel  that  insight  includes  suddenly  solving  a  puzzle-type  problem  while 
actively  attempting  it  [10].  Within  our  hypothesis,  this  is  merely  a  moment  of  complex 
intuition  preceded  by  a  restructuring  of  the  problem.  In  our  opinion,  the  fascination  with 
insight  is  in  the  ability  to  unintentionally  realize  the  relation  of  a  current  situation  to  an 
inactive  problem  residing  in  a  nearly  infinite  memory.  Thus,  we  will  define  insight  as  the 
inadvertent  realization  of  the  applicability  of  an  idea  or  situation  to  a  previously  unrelated 
problem  that  results  in  a  novel  and  productive  integration  of  the  two. 

We  believe  that  insight  is  heavily  dependent  on  mind  wandering  and  the  global 
awareness  of  one’s  meta-cognition.  While  working  on  a  problem,  one  develops 
associations  and  factual  links  between  concepts  in  memory,  allowing  the  problem  to  be 
recalled  later  in  the  same  manner.  Meta-cognition  becomes  aware  of  these  problems 
knowing  they  are  of  global  interest.  Thus,  if  any  relation  becomes  active  through 
intuitive  thought,  meta-cognition  immediately  diverts  attention  to  mapping  the  activation 
to  the  problem.  Sometimes  this  recollection  might  occur  following  strong  activation  of  a 
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network  directly  related  to  the  problem,  such  as  suggested  in  the  Opportunistic 
Assimilation  hypothesis.  However,  we  suggest  that  activations  can  occur  based  on  more 
abstract  relations,  particularly  due  to  the  overlapping  of  neural  networks.  For  example, 
for  Archimedes,  the  conceptualization  of  his  body  causing  the  water  to  overflow  may 
have  partially  overlapped  the  existing  network  containing  his  problem.  Through  this 
activation,  intuitive  thought  could  build  a  perceivable  portion  of  the  network,  allowing 
meta-cognition  to  initiate  mapping. 

In  summary,  our  hypothesis  states  that  all  levels  of  problem  solving  occur  through  the 
dynamic  use  of  a  set  of  mechanisms  whose  functions  are  coordinated  by  a  meta-cognitive 
component.  Rational  thought  serves  to  perfonn  cognitive  tasks  utilizing  factual 
information  stored  in  memory.  Traversal  of  memory  networks  during  such  tasks 
activates  related  networks  causing  intuitive  thought.  These  autonomic  activations  may  or 
may  not  be  perceived  depending  on  the  current  focus  of  meta-cognition.  When  working 
on  a  computational  intensive  task,  meta-cognition  will  focus  on  management  of  rational 
thought  and  suppression  of  disruptive  thought.  In  periods  of  rest,  intuitive  thought  is 
unrestricted  and  “mind-wandering”  may  occur.  In  solving  novel  problems,  both  types  of 
thought  are  used  to  develop  the  problem  and  discover  relevant  concepts.  Occasionally,  a 
unique  traversal  path  through  memory  may  simultaneously  activate  two  or  more 
previously  unrelated  networks.  Insight  is  considered  to  occur  if  such  activation  results  in 
a  beneficial  integration  of  the  networks. 

We  will  now  describe  our  implementation  and  testing  of  the  CASIE  Cognitive 
Architecture. 


CASIE  Cognitive  Architecture 

In  order  to  demonstrate  and  test  the  discussed  hypothesis  for  coordinated  function,  our 
theoretical  mechanisms  were  integrated  into  a  cognitive  architecture.  The  CASIE 
architecture  is  composed  of  theoretical  mechanisms  able  to  process  data  from  a  user  and 
cooperatively  solve  a  range  of  real-world  type  problems.  Medical  diagnosis  had  been 
chosen  as  the  testing  domain  due  to  its  wealth  of  information  and  manageable  data 
structure.  We  now  explain  the  logic  behind  the  expression  of  our  theories  through  a  set 
of  architecture  components. 

Medical  diagnosis  was  selected  as  a  testbed.  Selecting  a  domain  in  which  to  test  CASIE 
was  inherently  difficult.  One  can  design  a  theoretical  architecture  capable  of  handling  all 
types  of  information  understandable  by  humans.  Yet,  from  a  computational  standpoint, 
this  completeness  would  be  overly  ambitious.  Thus,  the  selected  testbed  had  to  be 
complex  enough  to  be  representative  of  real  world  problem  solving  scenarios  but  also 
remain  adoptable  by  a  computational  environment.  Four  criteria  were  specified  to  meet 
these  goals.  The  main  attraction  to  the  use  of  medical  diagnosis  was,  despite  the  nearly 
infinite  domain,  information  used  in  problem  solving  could  be  managed  and  was  readily 
available. 
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The  CASIE  architecture  consists  of  six  components  operating  in  parallel  to  collectively 
complete  tasks  through  various  methods.  These  components  serve  to  manage  the 
cooperation  of  uncontrolled  and  controlled  processes  involved  in  problem  finding  and 
solving.  Attention  is  dynamically  allocated  depending  on  the  architecture’s  state.  A 
visual  representation  of  the  architecture  can  be  seen  in  Figure  1. 


Figure  1:  CASIE  components  and  available  communication  channels 

Each  component  has  a  specific  role  in  the  architecture  and  its  behavior  is  dependent  on 
the  state  of  the  rest  of  the  system.  The  six  components,  Information,  Reasoning, 
Connection,  Regulatory,  Focus,  and  Frontier,  will  be  briefly  explained  through 
descriptions  of  basic  function  and  detailed  interaction  examples. 

The  Infonnation  component  is  representative  of  one’s  long  term  memory.  It  has  been 
designed  to  host  encountered  problems,  related  information,  solution  procedures,  and 
methods  to  validate  potential  solutions.  For  the  domain  of  medical  diagnosis,  these  data 
types  have  been  specified  into  patients,  symptoms,  diagnoses,  and  tests.  Patients  serve  as 
access  points  to  problems.  When  a  doctor  is  presented  with  a  diagnosis  case,  related 
information  is  gathered.  This  information  primarily  includes  the  patient’s  symptoms, 
which  are  then  used  to  find  potential  diagnoses.  A  doctor  may  then  validate  their  beliefs 
or  gain  new  information  though  the  use  of  medical  tests. 

The  data  stored  in  Information  is  traversed  and  utilized  based  on  relationships  between  its 
entities.  These  relationships  are  classified  as  either  factual  or  associative  based  on  their 
method  of  creation.  As  implied,  factual  relationships  represent  infonnation  one  feels  is 
definitively  true.  For  ease  of  implementation  as  well  as  the  aim  to  make  the  architecture 
expandable  to  other  domains,  factual  links  are  represented  through  four  single  phrases: 
“has”,  “is”,  “can  cause”,  and  “tests  for.”  The  use  of  these  phrases  is  outlined  in  Table  1. 
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Table  1:  Factual  relationship  phrases 


Phrase 

Signifies 

Used  to  related 

Example 

has 

Possession 

Patients  to  Diagnoses 

Patients  to  Symptoms 

John  ‘has’  Fatigue 

is 

Hierarchal/  Synonymical 

Symptoms  to  Symptoms 
Diagnoses  to  Diagnoses 

Lung  Cancer  ‘is’  Cancer 
Tiredness  ‘is’  Fatigue 

can  cause 

Cause  &  Effect 

Diagnoses  to  Symptoms 

Flu  ‘can  cause’  Fever 

tests  for 

Solution  validation 

Tests  to  Diagnoses 

MRI  ‘tests  for’  Tumor 

The  second  type  of  relationship  in  Information  serves  to  represent  links  within  data 
created  through  environment  interaction  and  processing.  This  type  of  relation  utilizes  the 
word  “association”  to  signify  a  relationship  between  entities.  Within  CASIE,  associative 
relationships  are  created  between  simultaneously  active  entities,  based  upon  the 
aforementioned  principle  of  synchronous  convergence.  A  diagram  of  the  CASIE 
memory  structure  can  be  seen  in  Figure  2. 


can  cause  can  cause  can  cause  can  cause  can  cause 


The  Reasoning  component  is  responsible  for  carrying  out  tasks  associated  with  rational 
thought.  As  a  task  is  worked  on,  Reasoning  temporarily  hosts  active  task  knowledge 
within  what  will  be  referred  to  as  the  whiteboard.  Based  on  contents  of  the  whiteboard, 
Reasoning  selects  appropriate  procedures  to  advance  towards  a  goal.  These  procedures 
can  include  recalling  related  data  from  Information,  dividing  tasks  into  subtasks,  making 
decisions,  and  requesting  activity  from  other  components.  To  perform  these  procedures, 
Reasoning  utilizes  factual  relationships  between  knowledge  in  Infonnation.  Reasoning  is 
also  responsible  for  the  direct  manipulation  and  addition  of  such  knowledge. 
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The  processes  of  the  Connection  component  are  representative  of  intuitive  thought. 
Based  on  the  contents  of  the  whiteboard  within  Reasoning,  Connection  continuously 
retrieves  associated  knowledge  from  Information.  If  the  system  is  moving  rapidly  from 
task  to  task,  Connection  is  only  capable  of  finding  associations  immediately  related  to  the 
task’s  domain.  However,  when  a  task  cannot  be  solved  or  one  is  not  active,  Connection 
is  able  to  seek  deeper  associations. 

The  Frontier  component  is  responsible  for  managing  CASIE’s  interaction  with  the 
outside  world.  When  new  information  enters  the  system,  Frontier  makes  it  available  to 
the  other  components.  Similarly,  any  information  that  must  be  expressed  internally  is 
presented  through  Frontier.  At  this  implementation,  Frontier  has  been  developed  to 
handle  textual  infonnation.  However,  more  advanced  versions  of  the  component  could 
be  representative  of  a  more  complete  sensory  system,  including  auditory  and  visual 
processing  as  well  as  mechanical  action. 

The  Focus  component  serves  to  maintain  a  list  of  active  and  inactive  problems.  While 
working  on  a  task  that  involves  multiple  subtasks,  Focus  hosts  the  list  of  jobs  that  must 
be  done  to  reach  the  overall  goal.  Additionally  if  tasks  are  interrupted  or  CASIE  reaches 
an  impasse,  tasks  are  stored  in  Focus  as  dormant  tasks  to  be  attempted  later. 

The  Regulatory  component  serves  to  manage  the  behavior  of  CASIE  from  a  global 
perspective.  Its  function  is  analogous  to  the  concept  of  meta-cognition.  When  new  tasks 
are  encountered,  Regulatory  determines  whether  or  not  the  incoming  task  should  interrupt 
the  current  activity  of  the  system.  When  engaged  in  a  task,  Regulatory  monitors  the 
system  activity  to  ensure  that  all  components  are  working  towards  the  global  goal. 
Additionally,  if  any  localized  activity  seems  to  relate  to  either  the  task  at  hand  or  a 
donnant  task  in  Focus,  Regulatory  will  shift  attention  to  investigate  the  use  of  that 
thought.  During  times  of  inactivity,  Regulatory  recalls  unsolved  tasks  from  Focus  to  be 
re-attempted. 

To  achieve  problem  solving  ability  CASIE’s  components  work  cooperatively.  The 
various  interactions  are  outlined  in  Table  1  and  their  use  is  detailed  through  the  examples 
following. 
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Table  2:  Component  Interactions 


Components 

Interaction 

Frontier  &  Regulatory 

As  commands  and  information  enter  the  system, 
Frontier  passes  them  to  Regulatory  to  handle  them. 
Regulatory  also  reports  system  status  to  Frontier. 

Frontier  &  Reasoning 

While  learning,  information  is  sent  from  Frontier  to 
Reasoning  for  storage.  External  actions  are  requested  by 
Reasoning  through  Frontier. 

Regulatory  &  Reasoning 

Regulatory  sends  commands  to  Reasoning  and  observes 
its  activity. 

Focus  &  Reasoning 

When  tasks  are  interrupted,  or  an  impasse  is  reached, 
Reasoning  sends  tasks  to  Focus.  Reasoning  also  passes 
sub-tasks  to  Focus. 

Focus  &  Regulatory 

Regulatory  uses  the  dormant  task  list  within  Focus  to 
determine  activity  during  periods  of  inactivity. 

Reasoning  &  Information 

Reasoning  recalls  knowledge  from  Information  using 
factual  relationships.  While  learning,  or  inference, 
Reasoning  stored  knowledge  in  Information. 

Connection  &  Information 

Connection  recalls  knowledge  from  Information  using 
association  relationships. 

In  CASIE,  insight  moments  occur  when  Regulatory  realizes  the  application  of  current 
information  to  the  solution  of  a  dormant  problem  residing  in  Focus.  Upon  reaching  an 
impasse,  the  task  as  well  as  all  of  its  failed  subtasks  are  stored  in  Focus.  Additionally 
associations  between  the  symptom  set  and  the  patient  are  created,  based  on  the  principle 
of  synchronous  convergence.  If  during  subsequent  thought,  the  association  becomes 
active  in  Connection,  Regulatory  interrupts  the  system  and  commands  Reasoning  to 
attempt  to  apply  the  newly  learned  information.  For  example,  when  diagnosing  the 
patient  ‘Sandy’,  shown  in  Figure  3a,  an  impasse  is  reached  after  expanding  the  two 
symptoms  as  much  as  possible.  The  case  is  stored  in  Focus  as  an  unsolved  task  and  an 
association  is  created  between  ‘Skin  Symptom  +  Neurological  Symptom’  and  ‘Sandy’. 
When  diagnosing  the  patient  ‘Vinny’,  shown  in  Figure  3b,  Reasoning  begins  to  expand 
the  symptom  set.  Ordinarily,  these  expansions  would  be  disregarded  and  most  likely  not 
ever  perceived,  as  Connection  discovers  an  association  with  the  diagnosis 
‘Neurofibromatosis’.  However  as  ‘Mass  in  Spinal  Cord’  and  ‘Tan  Skin  Patches’  expand 
to  ‘Neurological  Symptom’  and  ‘Skin  Symptom’  respectively,  Connection  also  discovers 
the  association  to  the  unsolved  case  of  ‘Sandy’.  Upon  this  realization,  Regulatory 
instructs  Reasoning  to  attempt  to  map  the  analog  case  to  the  base.  Reasoning  determines 
that  Vinny’s  diagnosis  is  capable  of  causing  Sandy’s  symptoms  and  the  diagnosis  is 
validated  through  a  test,  shown  in  Figure  3c. 


9 


f  Sandy  ^ 

has  has 


(Skin  Tjmorsj  (Tingling  in  Leqj 


Figure  3:  A.  First  diagnosis  leads  to  impasse,  as  no  associations  exist.  B. 

Expansion  of  another  patient’s  symptoms  leads  to  association  with 
‘Neurofibromatosis’.  C.  Following  insight  moment,  ‘Neurofibromatosis’  is 
validated  as  a  potential  diagnosis 

Implementation  of  the  CASIE  architecture  into  a  computer  environment  occurred  as  a 
collaborative  effort  between  two  teams.  The  majority  of  the  infrastructure  and 
component  communication  was  developed  by  the  team  at  SUNY  Binghamton,  while  the 
cognitive  algorithms  were  done  by  the  Dartmouth  team.  Cougaar,  a  JAVA  software 
architecture  that  allows  for  building  distributed  agent  based  applications,  was  selected  as 
the  platform  for  CASIE ’s  development.  Computer  implementation  allowed  for 
demonstration  of  the  aforementioned  theories  through  experiments  using  the  medical 
diagnosis  testbed. 

Synopsis  of  Experiments 

The  main  theme  in  our  hypothesis  is  that  insight  occurs  as  a  result  of  the  monitoring  of 
autonomic  intuitive  thought  through  meta-cognition.  To  demonstrate  this  theory,  several 
tests  were  conducted  in  attempt  to  trigger  an  insightful  moment  within  CASIE.  Tests 
consisted  of  a  target  problem  designed  to  reach  an  impasse,  and  a  base  problem  which 
could  be  solved  given  the  contents  of  CASIE ’s  memory.  In  each  test  CASIE  was  first 
presented  with  the  target  problem.  Either  immediately  following  or  after  intermediate 
cases,  the  base  problem  was  presented.  Three  sets  of  scenarios  were  designed  to 
demonstrate  the  various  levels  of  insight.  One  set  involved  a  base  problem  which  would 
directly  overlap  a  portion  of  the  target  problem  structure.  Successful  diagnosis  of  the 
target  problem  when  solving  the  base,  would  demonstrate  the  spontaneous  recollection  of 
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an  unsolved  problem  based  on  congruent  surface  attributes.  In  a  second  set  of  scenarios, 
expansions  of  the  entities  from  the  base  problem  would  coincide  with  those  of  the  target 
problem.  Successful  diagnosis  of  the  target  problem  in  this  case,  would  demonstrate  that 
the  inadvertent  activation  of  factually  related  networks  is  capable  of  triggering  insight.  A 
third  set  of  tests  involved  base  and  target  entities  with  similar  but  not  exact  matches,  such 
as  ‘Lung  Complication’  and  ‘Lung  Disease’.  Successful  diagnosis  in  this  case  would 
demonstrate  that  the  partial  activation  of  networks  can  trigger  insight. 

The  results  from  our  testbed  experiments  were  congruent  with  our  expectations. 
Successful  diagnosis  of  first  type  of  scenarios  demonstrated  that  through  direct  activation 
of  entities  from  an  unsolved  problem,  the  problem  could  be  recalled  through  meta- 
cognitive  processes  and  subsequently  reattempted  using  newly  learned  information.  We 
believe  this  to  be  a  weak  fonn  of  insight  as  it  utilizes  the  channels  of  intuition  and  meta¬ 
cognition  however  activation  of  the  exact  problem  components  is  required.  As  these  full 
activations  would  be  conscious,  it  is  predicted  that  the  solver  would  be  capable  of 
explaining  the  train  of  thought  that  led  to  the  moment  of  realization,  which  counters  our 
definition  of  insight.  The  second  set  of  scenarios  demonstrates  a  process  closer  to  our 
definition  of  insight,  in  which  the  unsolved  problem  is  recalled  through  automatic 
activation  of  related  entities.  In  these  cases,  such  activations  may  or  may  not  be 
conscious  depending  on  the  attention  of  meta-consciousness.  The  scenarios  from  this 
experiment  represent  cases  that  require  a  full  activation  of  an  entities  network.  However, 
the  ultimate  form  of  insight  has  been  described  as  only  requiring  partial  activation  of  a 
network  for  realization  to  occur.  This  form  was  represented  in  the  third  set  of  scenarios. 
As  expected,  CASIE  was  unable  to  solve  these  types  of  cases  due  to  the  limitation  of 
textual  memory.  Demonstration  of  this  type  of  insight  would  require  true  distributed 
memory. 

Conclusions 


It  is  hard  to  argue  that  the  human  brain  is  not  an  advanced  organ  of  extensive  capabilities. 
Most  are  fascinated  that  a  three-pound  mass  of  organic  material  is  able  to  compose 
artistic  masterpieces  and  develop  advanced  scientific  theories.  Even  the  simple  task  of 
deciding  what  to  eat  for  dinner  is  somewhat  intriguing.  The  brain  can  deal  with  a  wide 
range  of  tasks  using  various  methods.  Much  effort  has  gone  into  determining  how  the 
brain  is  able  to  solve  problems  at  particular  levels,  but  few  have  ventured  to  explain  all 
levels  of  problem  solving  through  the  use  of  common  resources  in  the  mind.  This  work 
has  attempted  this  task  through  the  presentation  of  a  theoretical  cognitive  architecture. 
Our  findings  demonstrate  that  all  levels  of  problem  solving  can  be  based  on  various 
levels  of  coordination  between  specialized  mechanisms  operating  in  parallel.  Rather  than 
a  result  of  search  speed,  extensive  abilities  can  result  from  elegance  of  storage,  automatic 
activation  of  concepts,  and  global  management. 

The  current  computational  version  of  the  CASIE  architecture  serves  to  demonstrate  the 
functionality  of  our  primary  theories.  However,  implementation  of  several  other 
functions  is  required  to  fully  exploit  the  power  of  the  architecture.  Future  efforts  could 
include  the  addition  of  learning  capability  through  both  inference  and  experience. 
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Following  this  addition,  CASIE  will  be  able  to  internally  manipulate  the  data  stored  in 
Information.  Following  a  complex  diagnosis,  associations  would  be  created  between 
elements  of  the  problem  structure.  Such  associations  would  aid  in  the  future  diagnosis  of 
related  problems.  Other  additions  include  decision  making  capability.  This  would  allow 
CASIE  to  handle  more  realistic  scenarios  in  which  symptoms  associate  with  multiple 
diagnoses.  Detennining  which  to  investigate  first  would  depend  on  congruence  with  the 
symptom  set  and  diagnosis  severity. 

Following  the  addition  of  learning  and  decision  making,  the  CASIE  architecture  would 
be  suitable  for  the  addition  of  advanced  abilities.  As  realized  throughout  development, 
higher-level  forms  of  problem  solving  and  creativity  are  highly  dependent  on  a  wealth  of 
interrelated  information  across  many  domains.  It  is  hypothesized  that  by  exposing 
CASIE  to  a  large  source  of  searchable  information,  the  creative  ability  and  occurrence  of 
insight  would  be  significantly  increased.  Unfortunately  the  task  of  manually  developing 
a  bounded  knowledge  base  is  not  only  a  laborious  task;  it  also  defeats  the  purpose  of 
developing  a  system  able  to  apply  its  perceptions  to  stored  problems.  Thus,  CASIE 
would  require  a  module  allowing  it  to  acquire  information  as  easily  as  humans.  As  some 
readers  may  have  noticed,  the  structure  within  the  Information  component  strongly 
resembles  proposed  structures  of  the  semantic  web,  which  is  foreseen  as  the  next 
implementation  of  the  internet.  The  semantic,  or  machine  searchable  web,  would  allow 
CASIE  to  gain  factual  infonnation  from  a  nearly  infinite  textual  source.  Currently 
CASIE  is  limited  to  gaining  information  through  a  human  user.  If  an  impasse  is  reached, 
the  system  cannot  seek  additional  information  as  a  real  person  is  able  to  do.  Using  a 
semantic  web  interface,  CASIE  would  be  capable  of  learning  new  symptoms,  diagnoses 
and  tests  as  well  as  their  relationships. 
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Abstract 

Existing  graph  partitioning  approaches  are  mainly 
based  on  optimizing  edge  cuts  and  do  not  take  the  distri¬ 
bution  of  edge  weights  (link  distribution)  into  consider¬ 
ation.  In  this  paper,  we  propose  a  general  model  to  parti¬ 
tion  graphs  based  on  link  distributions.  This  model  for¬ 
mulates  graph  partitioning  under  a  certain  distribution 
assumption  as  approximating  the  graph  affinity  matrix 
under  the  corresponding  distortion  measure.  Under  this 
model,  we  derive  a  novel  graph  partitioning  algorithm  to 
approximate  a  graph  affinity  matrix  under  various  Breg- 
man  divergences,  which  correspond  to  a  large  exponen¬ 
tial  family  of  distributions.  We  also  establish  the  con¬ 
nections  between  edge  cut  objectives  and  the  proposed 
model  to  provide  a  unified  view  to  graph  partitioning. 

Introduction 

Graph  partitioning  is  an  important  problem  in  many  machine 
learning  applications,  such  as  circuit  partitioning,  VLSI  de¬ 
sign,  task  scheduling,  bioinformatics,  and  social  network 
analysis.  Existing  graph  partitioning  approaches  are  mainly 
based  on  edge  cut  objectives,  such  as  Kernighan-Lin  objec¬ 
tive  (Kernighan  &  Lin  1970),  normalized  cut  (Shi  &  Malik 

2000) ,  ratio  cut  (Chan,  Schlag,  &  Zien  1993),  ratio  asso- 
ciation(Shi  &  Malik  2000),  and  min-max  cut  (Ding  et  al. 

2001) . 

The  main  motivation  of  this  study  comes  from  the  fact  that 
graphs  from  different  applications  may  have  very  different 
statistical  characteristics  for  their  edge  weights.  Specifically, 
the  graphs  may  have  very  different  link  distributions,  where 
the  link  distribution  refers  to  the  distribution  of  edge  weights 
in  a  graph.  For  example,  in  a  graph  with  binary  weight 
edges,  the  link  distribution  can  be  modeled  as  a  Bernoulli 
distribution;  in  a  graph  with  edges  of  real  value  weights,  the 
link  distribution  may  be  modeled  as  an  exponential  distrib¬ 
ution  or  a  normal  distribution.  This  fact  naturally  raises  the 
following  questions:  is  it  appropriate  to  use  edge  cut  objec¬ 
tives  for  all  kinds  of  graphs  with  different  link  distributions? 
If  not,  what  kinds  of  graphs  the  edge  cut  objectives  work 
well  for?  How  to  make  use  of  link  distributions  to  partition 
different  types  of  graphs?  This  paper  attempts  to  answer 
these  questions. 

Copyright  ©  2007,  American  Association  for  Artificial  Intelli¬ 
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Another  motivation  of  this  study  is  to  derive  an  effective 
algorithm  to  improve  the  existing  graph  partitioning  algo¬ 
rithms  on  some  aspects.  For  example,  the  popular  spectral 
approaches  involve  expensive  eigenvector  computation  and 
extra  post-processing  on  eigenvectors  to  obtain  the  partition¬ 
ing;  the  multi-level  approaches  such  as  METIS  (Karypis  & 
Kumar  1998)  restrict  partitions  to  have  an  equal  size. 

In  this  paper,  we  propose  a  general  model  to  partition 
graphs  based  on  link  distributions.  The  key  idea  is  that  by 
viewing  the  link  distribution  of  a  graph  as  a  mixture  of  link 
distributions  within  and  between  different  partitions,  we  can 
learn  the  mixture  components  to  find  the  partitioning  of  the 
graph.  The  model  formulates  partitioning  a  graph  under  a 
certain  distribution  assumption  as  approximating  the  graph 
affinity  matrix  under  the  corresponding  distortion  measure. 
Second,  under  this  model,  we  derive  a  novel  graph  partition¬ 
ing  algorithm  to  approximate  a  graph  affinity  matrix  under 
various  Bregman  divergences,  which  correspond  to  a  large 
exponential  family  distributions.  Our  theoretic  analysis  and 
experiments  demonstrate  the  the  potential  and  effectiveness 
of  the  proposed  model  and  algorithm.  Third,  we  also  es¬ 
tablish  the  connections  between  the  proposed  model  and  the 
edge  cut  objectives  to  provide  a  unified  view  to  graph  parti¬ 
tioning. 

We  use  the  following  notations  in  this  paper.  Capital  let¬ 
ters  such  as  A,  B  and  C  denote  matrices;  Al;j  or  [A] de¬ 
note  the  (i,j)th  element  in  A;  small  boldface  letters  such 
as  a,  b  and  c  denote  column  vectors.  A  graph  is  denoted 
by  G  —  (V,  £,  A),  which  is  made  up  of  a  set  of  vertices  V 
and  a  set  of  edges  £,  and  the  affinity  matrix  A  of  dimension 
|V|  x  |  V | .  whose  entries  represent  the  weights  of  the  edges. 

Related  Work 

Graph  partitioning  divides  a  graph  into  subgraphs  by  finding 
the  best  edge  cuts  of  the  graph.  Several  edge  cut  objectives, 
such  as  the  average  cut  (Chan,  Schlag,  &  Zien  1993),  aver¬ 
age  association  (Shi  &  Malik  2000),  normalized  cut  (Shi  & 
Malik  2000),  and  min-max  cut  (Ding  et  al.  2001),  have  been 
proposed.  Various  spectral  algorithms  have  been  developed 
for  these  objective  functions  (Chan,  Schlag,  &  Zien  1993; 
Shi  &  Malik  2000;  Ding  et  al.  2001).  These  algorithms 
use  the  eigenvectors  of  a  graph  affinity  matrix,  or  a  matrix 
derived  from  the  affinity  matrix,  to  partition  the  graph. 

Multilevel  methods  have  been  used  extensively  for  graph 
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partitioning  with  the  Kernighan-Lin  objective,  which  at¬ 
tempts  to  minimize  the  cut  in  the  graph  while  maintaining 
equal-sized  clusters  (Bui  &  Jones  1993;  Hendrickson  &  Le- 
land  ;  Karypis  &  Kumar  1998).  In  multilevel  algorithms, 
the  graph  is  repeatedly  coarsened  level  by  level  until  only  a 
small  number  of  nodes  are  left.  Then,  an  initial  partitioning 
on  this  small  graph  is  performed.  Finally,  the  graph  is  un¬ 
coarsened  level  by  level,  and  at  each  level,  the  partitioning 
from  the  previous  level  is  refined  using  a  refinement  algo¬ 
rithm. 

Recently,  graph  partitioning  with  an  edge  cut  objec¬ 
tive  has  been  shown  to  be  mathematically  equivalent  to  an 
appropriately  weighted  kernel  k-means  objective  function 
(Dhillon,  Guan,  &  Kulis  2004;  2005).  Based  on  this  equiva¬ 
lence,  the  weighted  kernel  k-means  algorithm  has  been  pro¬ 
posed  for  graph  partitioning  (Dhillon,  Guan,  &  Kulis  2004; 
2005).  Yu,  Yu,  &  Tresp  (2005)  propose  graph-factorization 
clustering  for  the  graph  partitioning,  which  seeks  to  con¬ 
struct  a  bipartite  graph  to  approximate  a  given  graph.  Long 
et  al.  (2006)  propose  a  framework  of  relation  summary  net¬ 
work  to  cluster  K-partite  graphs. 

Another  related  field  is  unsupervised  learning  with  Breg- 
man  divergences  (S.D.Pietra  2001;  Wang  &  Schuurmans 
2003).  Banerjee  et  al.  (2004b)  generalizes  the  classic  k- 
means  to  Bregman  divergences.  A  generalized  co-clustering 
framework  is  presented  by  Banerjee  et  al.  (2004a)  wherein 
any  Bregman  divergence  can  be  used  in  the  objective  func¬ 
tion. 

Model  Formulation 

We  first  define  the  link  distribution  as  the  follows. 

Definition  1.  Given  a  graph  G  =  (V,  £ ■  A),  the  link  distri¬ 
bution  /y iV2  is  the  probability  density  of  edge  weights  be¬ 
tween  nodes  in  Vi  and  V2,  where  Vi ,  V2  C  V. 

Based  on  Definition  1,  the  link  distribution  for  the 
whole  graph  G  is  /yy.  The  model  assumption  is  that 
if  G  has  k  disjoint  partitions  Vi,...,J4,  then  /yy  = 
Y.\<i<j<k  7T'i:ifv,  Vji  where  ni3  is  the  mixing  probability 
such  that  Ei<i<j<k  nij  =  1-  Basically,  the  assumption 
states  that  the  link  distribution  of  a  graph  is  a  mixture  of  the 
link  distributions  within  and  between  partitions.  The  intu¬ 
ition  behind  the  assumption  is  that  the  vertices  within  the 
same  partition  are  related  in  a  (statistically)  similar  way  to 
each  other  and  the  vertices  from  different  partitions  are  re¬ 
lated  in  different  ways  to  each  other  from  those  within  the 
same  partition.  In  Section  5,  we  show  that  the  traditional 
edge  cut  objectives  also  implicitly  make  this  assumption  un¬ 
der  a  normal  distribution  with  extra  constraints. 

Let  us  have  an  illustrative  example.  Figure  1(a)  shows 
a  graph  of  six  vertices  and  seven  unit  weight  edges.  It  is 
natural  to  partition  the  graph  into  two  components,  Vi  = 
{yi,U2G,3}  and  V2  =  {u4,U5,V6}-  The  link  distribution 
of  the  whole  graph  can  be  modeled  as  a  Bernoulli  distribu¬ 
tion  /vv(ir;0yy)  with  the  parameter  0yy  =  (the  num¬ 
ber  of  edges  in  the  graph  is  7  and  the  number  of  possi¬ 
ble  edges  is  15).  Similarly,  the  link  distributions  for  edges 
within  and  between  Vi  and  V2  are  Bernoulli  distributions, 
/viVi(x;  6>yiVi)  with  6VlVl  =  1,  /y2y2(Y;  #v2v2)  with 
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Figure  1:  A  graph  with  two  partitions  (a)  and  its  graph  affin¬ 
ity  matrix  (b). 

0y2v2  =  1>  and  /viv2(ir; 0VlVf)  with  6Vl y2  =  g.  Note 
that  /yy  is  a  mixture  of  fv1v1,  /y2y2  and  /yjV,,  which  can 
be  verified  by  dvv  =  ^6»y1y1  +  ^6^4  y2  +  ^6»y2y2  (the 
mixing  probability  for  /y,  y2 ,  yg,  follows  the  fact  that  the 
number  of  possible  edges  between  Vi  and  V2  is  9;  similarly 
for  other  proportion  probabilities). 

Learning  mixture  components  of  the  link  distribution  of  a 
graph  is  much  more  difficult  than  learning  a  traditional  mix¬ 
ture  model,  since  the  graph  structure  needs  to  be  considered, 
i.e.,  our  goal  is  to  find  the  mixture  components  associated 
with  subgraphs  and  not  just  to  simply  draw  the  similar  edges 
from  anywhere  in  the  graph  to  form  a  component.  For  exam¬ 
ple,  in  Figure  1(a),  without  considering  the  graph  structure, 
the  edge  weights  from  two  partitions  Vi  and  V2  cannot  be 
separated.  To  tackle  this  difficulty,  we  model  the  problem 
based  on  the  graph  affinity  matrix,  which  contains  all  the 
information  for  a  graph. 

Figure  1(b)  shows  the  graph  affinity  matrix  for  the  graph 
in  Figure  1(a).  We  observe  that  if  the  vertices  within 
the  same  partition  are  arranged  together,  the  edge  weights 
within  and  between  partitions  form  the  diagonal  blocks  and 
off-diagonal  blocks,  respectively.  Hence,  learning  the  link 
distribution  in  a  graph  is  equivalent  to  learning  different  dis¬ 
tributions  for  non-overlapping  blocks  in  the  graph  affinity 
matrix.  To  estimate  the  sufficient  statistic  for  each  block, 
we  need  to  solve  the  problem  of  likelihood  maximization. 
It  is  shown  that  maximizing  likelihood  under  a  certain  dis¬ 
tribution  corresponds  to  minimizing  distance  under  the  cor¬ 
responding  distortion  measure  (Collins,  Dasgupta,  &  Reina 
2001).  For  example,  the  normal  distribution,  Bernoulli  dis¬ 
tribution,  multinomial  distribution  and  exponential  distrib¬ 
ution  correspond  to  Euclidean  distance,  logistic  loss,  KL- 
divergence  and  Itakura-Satio  distance,  respectively.  There¬ 
fore,  learning  the  distributions  of  the  blocks  in  a  graph  affin¬ 
ity  matrix  can  be  formulated  as  approximating  the  affinity 
matrix  under  a  certain  distortion  measure.  Formally,  we  de¬ 
fine  graph  partitioning  as  the  following  optimization  prob¬ 
lem  of  matrix  approximation. 

Definition  2.  Given  a  graph  G  =  (V,£,A)  where  A  £ 
R"xrl,  a  distance  function  32,  and  a  positive  integer  k,  the 
optimized  partitioning  is  given  by  the  minimization, 

min  D(A1CBCt),  (1) 

C'e{0,l}nxfc,BeRfe*'s 

where  C  £  {0,1  }nxk  is  an  indicator  matrix  such  that 
y~L  C,3  =  1,  i.e.,  Cij  =  1  indicates  that  the  ith  vertex  be¬ 
longs  to  the  jth  partition,  and  32  is  a  separable  distance 
function  such  that  32(X,  Y)  =  y/„.  .  32(Xy,  Yj3). 
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We  call  the  model  in  Definition  2  as  the  Graph  Partition¬ 
ing  with  Link  Distribution  (GPLD).  GPLD  provides  not  only 
the  partitioning  of  the  given  graph,  which  is  denoted  by  the 
partition  indicator  matrix  C,  but  also  the  partition  repre¬ 
sentative  matrix  B,  which  consists  of  the  sufficient  statistics 
for  edge  weights  within  and  between  partitions.  For  exam¬ 
ple,  B  =  Ag  | 9  for  the  example  in  Fig  1(b).  B  also 

provides  an  intuition  about  the  quality  of  the  partitioning, 
since  the  larger  the  difference  between  the  diagonal  and  the 
off-diagonal  elements,  the  better  the  partitions  are  separated. 
Note  that  GPLD  does  not  restrict  A  to  be  symmetric  or  non¬ 
negative.  Hence,  it  is  possible  to  apply  GPLD  to  directed 
graphs  or  graphs  with  negative  weights,  though  in  this  pa¬ 
per  our  main  focus  is  undirected  graphs  with  non-negative 
weights. 

Algorithm  Derivation 

First  we  derive  an  algorithm  for  GPLD  model  based  on  the 
most  popular  distance  function.  Euclidean  distance  function. 
Under  Euclidean  distance  function,  our  task  is 


min  WA-CBCAy.  (2) 

Ce{0,l}"xl,B6Rtx,i 

We  prove  the  following  theorem  which  is  the  basis  of  our 
algorithm. 

Theorem  3.  IfC  £  {0,  l}nxfc  and B  £  is  the  optimal 
solution  to  the  minimization  in  (2),  then 

B  =  (CTC)  ~ 1  CtAC(CtC)  ~ 1 .  (3) 

Proof.  The  objective  function  in  Definition  2  can  be  ex¬ 
panded  as  follows. 

L  =  \\A-CBCt\\2 

=  tr((A  —  CBCT)T  (A  —  CBCt)) 

=  tr (AtA)  -  2tr (CBCTA)  +  tr {CBCTCBCT) 
Take  the  derivative  with  respect  to  B,  we  obtain 

=  -2  CtBC  +  2  CtCBCtC.  (4) 

oB 

Solve  =  0  to  obtain 

B  =  (CTC)~1CT  AC(CtC)~1-,  (5) 

This  completes  the  proof  of  the  theorem.  □ 

Based  on  Theorem  3,  we  propose  an  alternative  optimiza¬ 
tion  algorithm,  which  alternatively  updates  B  and  C  until 
convergence.  We  first  fix  C  and  update  B.  Eq  (3)  in  Theo¬ 
rem  3  provides  an  updating  rule  for  B, 

B  =  (CTC)  ~ 1  CtAC{CtC)  ~ 1 .  (6) 


This  updating  rule  can  be  implemented  more  efficiently 
than  it  appears.  First,  it  does  not  really  involve  comput¬ 
ing  inverse  matrices,  since  CTC  is  a  special  diagonal  ma¬ 
trix  with  the  size  of  each  cluster  on  its  diagonal  such  that 
[CTC]PP  =  |7TP|,  where  \ttp\  denotes  the  size  of  the  pth  par¬ 
titioning;  second,  the  product  of  CT AC  can  be  calculated 


without  normal  matrix  multiplication,  since  C  is  an  indica¬ 
tor  matrix. 

Then,  we  fix  B  and  update  C.  Since  each  row  of  C  is  an 
indicator  vector  with  only  one  element  equal  to  1,  we  adopt 
the  re-assignment  procedure  to  update  C  row  by  row.  To 
determine  which  element  of  the  hth  row  of  C  is  equal  to  1, 
for  p  1, ....  fc,  each  time  we  let  C),p  =  1  and  compute  the 
objective  function  L  —  ||A  —  CBCT ||2,  which  is  denoted 
as  Lp,  then 

ChP *  =  1  for  p*  =  arg  min  Lp  (7) 

p 

Note  that  when  we  update  the  /ith  row  of  C,  the  necessary 
computation  involves  only  the  Mi  row  or  column  of  A  and 
CBCT. 

Therefore,  updating  rules  (6)  and  (7)  provide  a  new  graph 
partitioning  algorithm,  GPLD  under  Euclidean  distance. 

Presumably  for  a  specific  distance  function  used  in  De¬ 
finition  2,  we  need  to  derive  a  specific  algorithm.  How¬ 
ever,  a  large  number  of  useful  distance  functions,  such  as 
Euclidean  distance,  generalized  I-divergence,  and  KL  di¬ 
vergence,  can  be  generalized  as  the  Bregman  divergences 
(S.D.Pietra  2001;  Banerjee  et  al.  2004b),  which  correspond 
to  a  large  number  of  exponential  family  distributions.  More¬ 
over,  the  nice  properties  of  Bregman  divergences  make  it 
easy  to  generalize  updating  rules  (6)  and  (7)  to  all  Breg¬ 
man  divergences.  The  definition  of  a  Bregman  divergence 
is  given  as  follows. 

Definition  4.  Given  a  strictly  convex  function,  f  :  S  i— >  R, 
defined  on  a  convex  set  S  C  and  differentiable  on  the 
interior  of  S,  int(S),  the  Bregman  divergence  D ^  :  S  X 
int(S)  i— >  [0,  oo )  is  defined  as 

D<t,{x,y)  =  (j>(x)  -  <j>{y)  -  (x  -  y)TX(j)(y),  (8) 

where  \7<f>  is  the  gradient  off. 

Table  1  shows  a  list  of  popular  Bregman  divergences  and 
their  corresponding  Bregman  convex  functions.  The  follow¬ 
ing  Theorem  provide  an  important  property  of  Bregman  di¬ 
vergence. 

Theorem  5.  Let  X  be  a  random  variable  taking  values  in 
X  =  {xj}”=1  C  SC.  W1  following  v.  Given  a  Bregman 
divergence  D $  :  S  X  int(S)  i— >  [0,  oo),  the  problem 

mmEylD^X,  s)]  (9) 

sES 

has  a  unique  minimizer  given  by  s*  =  Ev  [X ] . 

The  proof  of  Theorem  5  is  omitted  (please  refer 
(S.D.Pietra  2001;  Banerjee  et  al.  2004b)).  Theorem  5  states 
that  the  Bregman  representative  of  a  random  variable  is  al¬ 
ways  the  expectation  of  the  variable.  Hence,  when  given  a 
sample  of  a  random  variable,  the  optimal  estimation  of  the 
Bregman  representative  is  always  the  mean  of  the  sample. 
Under  the  GPLD  model,  Bpq  is  the  Bregman  representative 
of  each  block  of  an  affinity  matrix.  When  C  is  given,  i.e.,  the 
membership  of  each  block  is  known,  according  to  Theorem 
5,  Bpq  is  obtained  as  the  mean  of  each  block. 
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Table  1:  A  list  of  Bregman  divergences  and  the  corresponding  convex  functions. 


Name 

Dtf,(x,  y) 

4>{x) 

Domain 

Euclidean  distance 

iwr 

IP 

Generalized  I-divergence 

Eti*Uog(fi)-n=i(:ni-i/0 

Ef=l  xi  log(*0 

mi 

Logistic  loss 

x  l°g(f)  +  (!  —  *)  log(i5f ) 

x  log(x)  +  (1  —  x)  log(l  —  x) 

{0,1} 

Itakura-Saito  distance 

%  -  log  xy-  1 

—  log* 

(0,  oo ) 

Hinge  loss 

max{0,  —  2sign(— y)x} 

1*1 

R\{0| 

KL-divergence 

Eti^iog(t) 

Et :lxi  log(*i) 

d- Simplex 

Mahalanobis  distance 

(x-y)rA(x-y) 

x'Ax 

IP 

Algorithm  1  Graph  Partitioning  with  Bregman  Divergences 
Input:  A  graph  affinity  matrix  A,  a  Bregman  divergence 
D^,  and  a  positive  integer  k. 

Output:  A  partition  indicator  matrix  C  and  a  partition  rep¬ 
resentative  matrix  B. 

Method: 

1:  Initialize  B. 

2:  repeat 

3:  for  h  =  1  to  n  do 

4:  ChP *  =  1  for  p*  =  arg  minp  Lp  where  Lp  denotes 

D<p(A,CBCT)  for  Chp  =  1. 

5:  end  for 

6:  B  =  (CTC)~1CTAC(CTC)~1. 

7:  until  convergence 


where  np  and  irq  denote  the  pth  and  the  gth  cluster,  respec¬ 
tively,  and  1  <  p  <  k,  1  <  q  <  k,l  <  i  <  n  and 
1  <  j  <  n.  If  we  write  Eq  (10)  in  a  matrix  form,  we 
obtain  Eq.  (3),  i.e.,  Theorem  3  is  true  for  all  Bregman  di¬ 
vergences.  Hence,  updating  rule  (6)  is  applicable  to  GPLD 
with  any  Bregamen  divergneces.  For  updating  rule  (7),  there 
is  only  a  minor  change  for  a  given  Bregman  divergence,  i.e., 
we  calculate  the  object  function  L  based  on  this  given  breg¬ 
man  divergence. 

Therefore,  we  obtain  a  general  graph  partitioning  al¬ 
gorithm,  Graph  Partitioning  with  Bregman  Divergences 
(GPBD),  which  is  summarized  in  Algorithm  1.  Unlike  the 
traditional  graph  partitioning  approaches,  this  simple  algo¬ 
rithm  is  capable  of  partitioning  graphs  under  different  link 
distribution  assumptions  by  adopting  different  Bregman  di¬ 
vergences.  The  computational  complexity  of  GPBD  can  be 
shown  to  be  0(tn2k)  for  t  iterations.  For  a  sparse  graph,  it 
is  reduced  to  0(t\£\k).  GPBD  is  faster  than  the  popular 
spectral  approaches,  which  involve  expensive  eigenvector 
computation  (typically  0(n3))  and  extra  post-processing  on 
eigenvectors  to  obtain  the  partitioning.  Comparing  with  the 
multi-level  approaches  such  as  METIS  (Karypis  &  Kumar 
1998),  GPBD  does  not  restrict  partitions  to  have  an  equal 
size. 

The  convergence  of  Algorithm  1  is  guaranteed  based  on 
the  following  facts.  First,  based  on  Theorem  3  and  Theorem 
5,  the  objective  function  is  non-increasing  under  updating 
rule  (6);  second,  by  the  criteria  for  reassignment  in  updating 
rule  (7),  it  is  trivial  to  show  that  the  objective  function  is 
non-increasing  under  updating  rule  (7). 


A  Unified  View  to  Graph  Partitioning 

In  this  section,  we  establish  the  connections  between  the 
GPLD  model  and  the  edge  cut  objectives  to  provide  a  unified 
view  for  graph  partitioning. 

In  general,  the  edge  cut  objectives,  such  as  ratio  associ¬ 
ation  (Shi  &  Malik  2000),  ratio  cut(Chan,  Schlag,  &  Zien 
1993),  Kernighan-Lin  objective  (Kernighan  &  Lin  1970), 
and  normalized  cut  (Shi  &  Malik  2000),  can  be  formu¬ 
lated  as  the  following  trace  maximization  (Zha  et  al.  2002; 
Dhillon,  Guan,  &  Kulis  2004;  2005), 

maxtr(C’TAC').  (11) 


In  (11),  typically  C  is  a  weighted  indicator  matrix  such  that 


if  Vi  €  7 Tj 

otherwise 


where  tt:i  denotes  the  number  of  nodes  in  the  yth  partition. 
In  other  words,  C  satisfies  the  constraints  C  £  R7xfc  and 
CTC  =  Jfc,  where  Ik  is  the  k  x  k  identity  matrix. 

We  propose  the  following  theorem  to  show  that  the  var¬ 
ious  edge  cut  objectives  are  mathematically  equivalent  to 
a  special  case  of  the  GPLD  model.  To  be  consistent  with 
the  weighted  indicator  matrix  used  in  edge  cut  objects,  in 
the  following  theorem  we  modify  the  constraints  on  C  as 
C  £  R+  and  CTC  =  Ik  to  make  C  to  be  a  weighted  indi¬ 
cator  matrix. 


Theorem  6.  The  GPLD  model  under  Euclidean  distance 
function  and  B  =  r Ik  for  r  >  0,  i.e., 

min  \\A-C{rIk)CT\\2  (12) 

ceR+xfc, 

CTC=Ik 

is  equivalent  to  the  maximization 

max  tr(  CT  AC) ,  (13) 

where  tr  denotes  the  trace  of  a  matrix. 


Proof.  Let  L  denote  the  objective  function  in  Eq.  12. 

L  =  \\A-rCCT\\2 

=  tr((A  —  rCCT)T  (A  —  rCCT)) 

=  tr (AtA)  -  2rtv(CCT  A)  +  r\(CCTCCT) 

=  tr (AtA)  -  2rtv(CT AC)  +  r2k 

The  above  deduction  uses  the  property  of  trace  tr (XY)  = 
tr (YX).  Since  tr (AT A),  r  and  k  are  constants,  the 
minimization  of  L  is  equivalent  to  the  maximization  of 
tr(CT  AC).  The  proof  is  completed.  □ 
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Table  2:  Summary  of  the  synthetic  graphs 


Graph 

Parameter 

n 

k 

distribution 

synl 

3  3  2.7 

3  2.7  2.7 

2.7  2.7  3 

300 

3 

Normal 

syn2 

6.9  7  6.3 

7  6.3  6.3 

6.3  6.3  7 

600 

3 

Poisson 

syn3 

20x20 

20000 

20 

Normal 

Theorem  6  states  that  with  the  partition  representative  ma¬ 
trix  B  restricted  to  be  of  the  form  rip-,  the  GPLD  model  un¬ 
der  Euclidean  distance  is  reduced  to  the  trace  maximization 
in  (13).  Since  various  edge  cut  objectives  can  be  formulated 
as  the  trace  maximization.  Theorem  6  establishes  the  con¬ 
nection  between  the  GPLD  model  and  the  existing  edge  cut 
objective  functions. 

Based  on  this  connection,  edge  cut  objectives  make  two 
implicit  assumptions  for  a  graph’s  link  distribution.  First, 
Euclidean  distance  in  Theorem  6  implies  normal  distribu¬ 
tion  assumption  for  the  edge  weights  in  a  graph.  Second, 
since  the  off-diagonal  entries  in  B  represent  the  mean  edge 
weights  between  partitions  and  the  diagonal  elements  of  B 
represent  the  the  mean  edge  weights  within  partitions,  re¬ 
stricting  B  to  be  of  the  form  r  ip-  for  r  >  0  implies  that  the 
edges  between  partitions  are  very  sparse  (close  to  0)  and  the 
edge  weights  within  partitions  have  the  same  positive  ex¬ 
pectation  r.  However,  these  two  assumptions  are  not  appro¬ 
priate  for  the  graphs  whose  link  distributions  deviate  from 
normal  distribution  or  dense  graphs.  Therefore,  compared 
with  the  edge  cut  based  approaches,  the  GPBD  algorithm 
is  more  flexible  to  deal  with  graphs  with  different  statistic 
characteristics. 

Experimental  Results 

Although  GPBD  actually  provides  a  family  of  algorithms 
under  various  Bregman  divergences,  due  to  the  space  limit, 
in  this  paper  we  present  the  experimental  evaluation  of  the 
effectiveness  of  the  GPBD  algorithm  under  two  most  pop¬ 
ular  divergences,  GPBD  under  Euclidean  Distance  (GPBD- 
ED)  corresponding  to  normal  distribution,  and  GPBD  un¬ 
der  Generalized  I-divergence  (GPBD-GI)  corresponding  to 
Poisson  distribution,  in  comparison  with  two  representative 
graph  partitioning  algorithms.  Normalized  Cut  (NC)  (Shi 
&  Malik  2000;  Ng,  Jordan,  &  Weiss  2001)  and  METIS 
(Karypis  &  Kumar  1998). 

We  use  synthetic  data  to  simulate  graphs  whose  edge 
weights  are  under  normal  and  poisson  distributions.  The  dis¬ 
tribution  parameters  to  generate  the  graphs  are  listed  in  the 
second  column  of  Table  2  as  matrices.  In  a  parameter  matrix 
P,  Pij  denotes  the  distribution  parameter  that  generates  the 
edge  weights  between  the  nodes  in  the  ?'th  partition  and  the 
nodes  in  the  jth  partition.  Graph  syn3  has  twenty  partitions 
of  20000  nodes  and  about  10  million  edges.  Due  to  the  space 
limit,  its  distribution  parameters  are  omitted  here. 

The  graphs  based  on  the  text  data  have  been  widely  used 
to  test  graph  partitioning  algorithms  (Ding  et  al.  2001; 
Dhillon  2001;  Zha  et  al.  2001).  In  this  study,  we  con¬ 


struct  real  graphs  based  on  various  data  sets  from  the  20- 
newsgroups  (Lang  1995)  data,  which  contains  about  20, 000 
articles  from  the  20  news  groups  and  can  be  used  to  generate 
data  sets  of  different  sizes,  balances  and  difficulty  levels.  We 
pre-process  the  data  by  removing  stop  words  and  file  headers 
and  selecting  the  top  2000  words  by  the  mutual  information. 
Each  document  is  represented  by  a  term-frequency  vector 
using  TF-IDF  weights  and  the  cosine  similarity  is  adopted 
for  the  edge  weight.  Specific  details  of  data  sets  are  listed 
in  Table  3.  For  example,  the  third  row  of  Table  3  shows 
that  three  data  sets  NG5-1,  NG5-2  and  NG5-3  are  generated 
by  sampling  from  five  newsgroups  with  size  900,  1200  and 
1450,  respectively,  and  with  balance  1.5,  2.5,  and  4,  respec¬ 
tively.  Here  balance  denotes  the  ratio  of  the  largest  partition 
size  to  the  smallest  partition  size  in  a  graph.  Normalized 
Mutual  Information  (NMI)  (Strehl  &  Ghosh  2002)  is  used 
for  performance  measure,  which  is  a  standard  way  to  mea¬ 
sure  the  cluster  quality.  The  final  performance  score  is  the 
average  of  twenty  runs. 

Table  4  shows  the  NMI  scores  of  the  four  algorithms.  For 
the  synthetic  data  synl  and  syn3  with  normal  link  distribu¬ 
tion,  the  GPBD-ED  algorithm,  which  assumes  normal  distri¬ 
bution  for  the  links,  provides  the  best  NMI  score.  Similarly, 
for  data  syn2  with  poisson  link  distribution,  the  GPBD-GI 
algorithm,  which  assumes  poisson  distribution  for  the  links, 
provides  the  best  performance. 

For  real  graphs,  we  observe  that  GPBD-GI  provides  best 
NMI  scores  for  all  the  graphs  and  preforms  significantly  bet¬ 
ter  than  NC  and  METIS  in  most  graphs  .  This  implies  that 
link  distributions  of  the  graphs  are  closer  to  Poisson  distribu¬ 
tion  than  normal  distribution.  How  to  determine  appropriate 
link  distribution  assumption  for  a  given  graph  is  beyond  the 
scope  of  this  paper.  However,  the  result  shows  that  the  ap¬ 
propriate  link  distribution  assumption  (appropriate  distance 
function  for  GPBD)  leads  to  a  significant  improvement  on 
the  partitioning  quality.  For  example,  for  the  graph  NG2-3, 
even  NC  totally  fails  and  other  algorithms  perform  poorly, 
GPBD-IS  still  provides  satisfactory  performance.  We  ob¬ 
serve  that  all  the  algorithms  perform  poorly  for  NG10.  One 
possible  reason  for  this  is  that  in  NG10  some  partitions  are 
heavily  overlapped  and  very  unbalanced.  We  also  observe 
that  the  performance  of  the  GPBD  with  the  appropriate  dis¬ 
tribution  is  more  robust  to  unbalanced  graphs.  For  exam¬ 
ple,  from  NG2-1  to  NG2-3,  the  performance  of  GPBD-IS 
decreases  much  less  than  those  of  NC  and  METIS.  One  pos¬ 
sible  reason  for  METIS’S  performance  deterioration  on  un¬ 
balanced  graphs  is  that  it  restricts  partitions  to  have  equal 
size. 

Conclusion 

In  this  paper,  we  propose  a  general  model  to  partition  graphs 
based  on  link  distribution.  This  model  formulates  graph  par¬ 
titioning  under  a  certain  distribution  assumption  as  approx¬ 
imating  the  graph  affinity  matrix  under  the  corresponding 
distortion  measure.  Under  this  model,  we  derive  a  novel 
graph  partitioning  algorithm  to  approximate  a  graph  affin¬ 
ity  matrix  under  various  Bregman  divergences,  which  cor¬ 
respond  to  a  large  exponential  family  of  distributions.  Our 
theoretic  analysis  and  experiments  demonstrate  the  potential 
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Table  3:  Subsets  of  Newsgroup  Data  for  constructing  graphs. 


Name 

Newsgroups  Included 

#  Documents 

Balance 

NG2-1/2/3 

alt. atheism,  comp. graphics 

330/525/750 

1.2/2. 5/4 

NG3-1/2/3 

comp.graphics,  rec.sport.hockey.talk.religion.misc 

480/675/900 

1.2/2. 5/4 

NG5-1/2/3 

comp.os.ms-windows.misc,  comp.windows.x, 
rec. motorcycles, sci. crypt,  sci.space 

900/1200/1450 

1.5/2. 5/4 

NG10 

comp.graphics,  comp.sys.ibm.pc.hardware,  rec. autos, 
rec.sport.basebalfsci.crypt,  sci.med.comp.  windows. x, 
soc.religion.christian.  talk.politics. mideast, talk.religion.misc 

5600 

7 

Table  4:  NMI  scores  of  the  five  algorithms 


Data 

NC 

METIS 

GPBD-ED 

GPBD-GI 

synl 

0.673  ±0.081 

0.538  ±0.016 

0.915  ±0.017 

0.893  ±  0.072 

syn2 

0.648  ±  0.052 

0.533  ±0.018 

0.828  ±0.139 

0.863  ±0.111 

syn3 

0.801  ±0.029 

0.799  ±0.010 

0.933  ±0.047 

0.811  ±0.055 

NG2-1 

0.482  ±  0.299 

0.759  ±  0.024 

0.678  ±0.155 

0.824  ±0.045 

NG2-2 

0.047  ±0.041 

0.400  ±  0.000 

0.283  ±0.029 

0.579  ±0.073 

NG2-3 

0.042  ±  0.023 

0.278  ±0.000 

0.194  ±0.008 

0.356  ±0.027 

NG3-1 

0.806  ±0.108 

0.810  ±0.017 

0.718  ±0.128 

0.852  ±0.081 

NG3-2 

0.185  ±0.116 

0.501  ±0.012 

0.371  ±0.131 

0.727  ±0.070 

NG3-3 

0.048  ±0.013 

0.546  ±  0.016 

0.235  ±0.091 

0.631  ±0.179 

NG5-1 

0.598  ±0.077 

0.616  ±0.032 

0.550  ±  0.043 

0.662  ±0.025 

NG5-2 

0.5612  ±0.030 

0.570  ±  0.020 

0.546  ±  0.032 

0.670  ±0.022 

NG5-3 

0.426  ±0.060 

0.574  ±0.018 

0.515  ±0.033 

0.668  ±0.035 

NG10 

0.281  ±0.011 

0.310  ±0.017 

0.308  ±0.015 

0.335  ±0.009 

and  effectiveness  of  the  proposed  model  and  algorithm.  We 
also  show  the  connections  between  the  traditional  edge  cut 
objectives  and  the  proposed  model  to  provide  a  unified  view 
to  graph  partitioning. 
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Abstract 

Relational  data  appear  frequently  in  many  ma¬ 
chine  learning  applications.  Relational  data  con¬ 
sist  of  the  pairwise  relations  (similarities  or  dis¬ 
similarities)  between  each  pair  of  implicit  ob¬ 
jects,  and  are  usually  stored  in  relation  matri¬ 
ces  and  typically  no  other  knowledge  is  avail¬ 
able.  Although  relational  clustering  can  be  for¬ 
mulated  as  graph  partitioning  in  some  applica¬ 
tions,  this  formulation  is  not  adequate  for  gen¬ 
eral  relational  data.  In  this  paper,  we  propose  a 
general  model  for  relational  clustering  based  on 
symmetric  convex  coding.  The  model  is  applica¬ 
ble  to  all  types  of  relational  data  and  unifies  the 
existing  graph  partitioning  formulation.  Under 
this  model,  we  derive  two  alternative  bound  opti¬ 
mization  algorithms  to  solve  the  symmetric  con¬ 
vex  coding  under  two  popular  distance  functions. 
Euclidean  distance  and  generalized  I-divergence. 
Experimental  evaluation  and  theoretical  analysis 
show  the  effectiveness  and  great  potential  of  the 
proposed  model  and  algorithms. 

1.  Introduction 

Two  types  of  data  are  used  in  unsupervised  learning,  fea¬ 
ture  and  relational  data.  Feature  data  are  in  the  form  of 
feature  vectors  and  relational  data  consist  of  the  pairwise 
relations  (similarities  or  dissimilarities)  between  each  pair 
of  objects,  and  are  usually  stored  in  relation  matrices  and 
typically  no  other  knowledge  is  available.  Although  feature 
data  are  the  most  common  type  of  data,  relational  data  have 
become  more  and  more  popular  in  many  machine  learning 
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applications,  such  as  web  mining,  social  network  analysis, 
bioinformatics,  VLSI  design,  and  task  scheduling.  Further¬ 
more,  the  relational  data  are  more  general  in  the  sense  all 
the  feature  data  can  be  transformed  into  relational  data  un¬ 
der  a  certain  distance  function. 

The  most  popular  way  to  cluster  similarity-based  relational 
data  is  to  formulate  it  as  the  graph  partitioning  problem, 
which  has  been  studied  for  decades.  Graph  partitioning 
seeks  to  cut  a  given  graph  into  disjoint  subgraphs  which 
correspond  to  disjoint  clusters  based  on  a  certain  edge  cut 
objective.  Recently,  graph  partitioning  with  an  edge  cut  ob¬ 
jective  has  been  shown  to  be  mathematically  equivalent  to 
an  appropriate  weighted  kernel  k-means  objective  function 
(Dhillon  et  al.,  2004;  Dhillon  et  al.,  2005).  The  assump¬ 
tion  behind  the  graph  partitioning  formulation  is  that  since 
the  nodes  within  a  cluster  are  similar  to  each  other,  they 
form  a  dense  subgraph.  However,  in  general  this  is  not  true 
for  relational  data,  i.e.,  the  clusters  in  relational  data  are  not 
necessarily  dense  clusters  consisting  of  strongly-related  ob¬ 
jects. 

Figure  1  shows  the  relational  data  of  four  clusters, 
which  are  of  two  different  types.  In  Figure  1,  C\  = 
{vi,v2,v3,v4}  and  C2  =  {u5,  v6,  v7,  n8}  are  two  tradi¬ 
tional  dense  clusters  within  which  objects  are  strongly  re¬ 
lated  to  each  other.  However,  C 3  =  {tig,  iqo,  v\\,v\2}  and 
C4  =  {^13,  fi4,  ^1.5,  ni6}  also  form  two  sparse  clusters, 
within  which  the  objects  are  not  related  to  each  other,  but 
they  are  still  ’’similar”  to  each  other  in  the  sense  that  they 
are  related  to  the  same  set  of  other  nodes.  In  Web  min¬ 
ing,  this  type  of  cluster  could  be  a  group  of  music  ’’fans” 
Web  pages  which  share  the  same  taste  on  the  music  and 
are  linked  to  the  same  set  of  music  Web  pages  but  are  not 
linked  to  each  other  (Kumar  et  al.,  1999).  Due  to  the  impor¬ 
tance  of  identifying  this  type  of  clusters  (communities),  it 
has  been  listed  as  one  of  the  five  algorithmic  challenges  in 
Web  search  engines  (Henzinger  et  al.,  2003).  Note  that  the 
cluster  structure  of  the  relation  data  in  Figure  1  cannot  be 
correctly  identified  by  graph  partitioning  approaches,  since 
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they  look  for  only  dense  clusters  of  strongly  related  objects 
by  cutting  the  given  graph  into  subgraphs;  similarly,  the 
pure  bi-partite  graph  models  cannot  correctly  identify  this 
type  of  cluster  structures.  Note  that  re-defining  the  rela¬ 
tions  between  the  objects  does  not  solve  the  problem  in  this 
situation,  since  there  exist  both  dense  and  sparse  clusters. 

If  the  relational  data  are  dissimilarity-based,  to  apply  graph 
partitioning  approaches  to  them,  we  need  extra  efforts  on 
appropriately  transforming  them  into  similarity-based  data 
and  ensuring  that  the  transformation  does  not  change  the 
cluster  structures  in  the  data.  Hence,  it  is  desirable  for 
an  algorithm  to  be  able  to  identify  the  cluster  structures 
no  matter  which  type  of  relational  data  is  given.  This  is 
even  more  desirable  in  the  situation  where  the  background 
knowledge  about  the  meaning  of  the  relations  is  not  avail¬ 
able,  i.e.,  we  are  given  only  a  relation  matrix  and  do  not 
know  if  the  relations  are  similarities  or  dissimilarities. 

In  this  paper,  we  propose  a  general  model  for  relational 
clustering  based  on  symmetric  convex  coding  of  the  re¬ 
lation  matrix.  The  proposed  model  is  applicable  to  the 
general  relational  data  consisting  of  only  pairwise  relations 
typically  without  other  knowledge;  it  is  capable  of  learning 
both  dense  and  sparse  clusters  at  the  same  time;  it  unifies 
the  existing  graph  partition  models  to  provide  a  generalized 
theoretical  foundation  for  relational  clustering.  Under  this 
model,  we  derive  iterative  bound  optimization  algorithms 
to  solve  the  symmetric  convex  coding  for  two  important 
distance  functions,  Euclidean  distance  and  generalized  I- 
divergence.  The  algorithms  are  applicable  to  general  rela¬ 
tional  data  and  at  the  same  time  they  can  be  easily  adapted 
to  learn  a  specific  type  of  cluster  structure.  For  example, 
when  applied  to  learning  only  dense  clusters,  they  provide 
new  efficient  algorithms  for  graph  partitioning.  The  con¬ 
vergence  of  the  algorithms  is  theoretically  guaranteed.  Ex¬ 
perimental  evaluation  and  theoretical  analysis  show  the  ef¬ 
fectiveness  and  great  potential  of  the  proposed  model  and 
algorithms. 

2.  Related  Work 

Graph  partitioning  (or  clustering)  is  a  popular  formulation 
of  relational  clustering,  which  divides  the  nodes  of  a  graph 
into  clusters  by  finding  the  best  edge  cuts  of  the  graph. 
Several  edge  cut  objectives,  such  as  the  average  cut  (Chan 
et  al.,  1993),  average  association  (Shi  &  Malik,  2000),  nor¬ 
malized  cut  (Shi  &  Malik,  2000),  and  min-max  cut  (Ding 
et  ah,  2001),  have  been  proposed.  Various  spectral  algo¬ 
rithms  have  been  developed  for  these  objective  functions 
(Chan  et  al.,  1993;  Shi  &  Malik,  2000;  Ding  et  al.,  2001). 
These  algorithms  use  the  eigenvectors  of  a  graph  affinity 
matrix,  or  a  matrix  derived  from  the  affinity  matrix,  to  par¬ 
tition  the  graph. 

Multilevel  methods  have  been  used  extensively  for  graph 
partitioning  with  the  Kernighan-Lin  objective,  which  at¬ 
tempt  to  minimize  the  cut  in  the  graph  while  maintaining 
equal-sized  clusters  (Bui  &  Jones,  1993;  Hendrickson  & 


■ 

(a)  (b) 

Figure  1.  The  graph  (a)  and  relation  matrix  (b)  of  the  relational 
data  with  different  types  of  clusters.  In  (b),  the  dark  color  denotes 
1  and  the  light  color  denotes  0. 

Leland,  1995;  Karypis  &  Kumar,  1998). 

Recently,  graph  partitioning  with  an  edge  cut  objective 
has  been  shown  to  be  mathematically  equivalent  to  an 
appropriate  weighted  kernel  k-means  objective  function 
(Dhillon  et  al.,  2004;  Dhillon  et  al.,  2005).  Based  on  this 
equivalence,  the  weighted  kernel  k-means  algorithm  has 
been  proposed  for  graph  partitioning  (Dhillon  et  al.,  2004; 
Dhillon  et  al.,  2005).  Yu  et  al.  (2005)  propose  the  graph- 
factorization  clustering  for  the  graph  partitioning,  which 
seeks  to  construct  a  bipartite  graph  to  approximate  a  given 
graph.  Nasraoui  et  al.  (1999)  propose  the  relational  fuzzy 
maximal  density  estimator  algorithm. 

In  this  paper,  our  focus  is  on  the  homogeneous  relational 
data,  i.e.,  the  objects  in  the  data  are  of  the  same  type.  There 
are  some  efforts  in  the  literature  that  can  be  considered 
as  clustering  heterogeneous  relational  data,  i.e.,  different 
types  of  objects  are  related  to  each  other.  For  example,  co¬ 
clustering  addresses  clustering  two  types  of  related  objects, 
such  as  documents  and  words,  at  the  same  time.  Dhillon 
et  al.  (2003)  propose  a  co-clustering  algorithm  to  maximize 
the  mutual  information.  A  more  generalized  co-clustering 
framework  is  presented  by  Banerjee  et  al.  (2004)  wherein 
any  Bregman  divergence  can  be  used  in  the  objective  func¬ 
tion.  Long  et  al.  (2005),  Li  (2005)  and  Ding  et  al.  (2006) 
all  model  the  co-clustering  as  an  optimization  problem  in¬ 
volving  a  triple  matrix  factorization. 

3.  Symmetric  Convex  Coding 

In  this  section,  we  propose  a  general  model  for  relational 
clustering.  Let  us  first  consider  the  relational  data  in  Fig¬ 
ure  1 .  An  interesting  observation  is  that  although  the  dif¬ 
ferent  types  of  clusters  look  so  different  in  the  graph  from 
Figure  1(a),  they  all  demonstrate  block  patterns  in  the  re¬ 
lation  matrix  of  Figure  1(b)  (without  loss  of  generality,  we 
arrange  the  objects  from  the  same  cluster  together  to  make 
the  block  patterns  explicit).  Motivated  by  this  observation, 
we  propose  the  Symmetric  Convex  Coding  (SCC)  model 
to  cluster  relational  data  by  learning  the  block  pattern  of 
a  relation  matrix.  Since  in  most  applications,  the  relations 
are  of  non-negative  values  and  undirected,  relational  data 
can  be  represented  as  non-negative,  symmetric  matrices. 
Therefore,  the  definition  of  the  SCC  is  given  as  follows. 

Definition  3.1.  Given  a  symmetric  matrix  A  £  M+,  a  dis¬ 
tance  function  33  and  a  positive  number  k,  the  symmetric 
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convex  coding  is  given  by  the  minimization, 

min  D(A,CBCt).  (1) 

C6R7  x  k  ,B£Ri_x  k 

C  1  =  1 


According  to  Definition  3.1,  the  elements  of  C  are  between 
0  and  1  and  the  sum  of  the  elements  in  each  row  of  C  equal 
to  1.  Therefore,  SCC  seeks  to  use  the  convex  combination 
of  the  prototype  matrix  B  to  approximate  the  original  rela¬ 
tion  matrix.  The  factors  from  SCC  have  intuitive  interpre¬ 
tations.  The  factor  C  is  the  soft  membership  matrix  such 
that  Cij  denotes  the  weight  that  the  ith  object  associates 
with  the  jth  cluster.  The  factor  B  is  the  prototype  matrix 
such  that  Bu  denotes  the  connectivity  within  the  ith  clus¬ 
ter  and  Bij  denotes  the  connectivity  between  the  ith  cluster 
and  the  jth  cluster. 

SCC  provides  a  general  model  to  learn  various  cluster 
structures  from  relational  data.  Graph  partitioning,  which 
focuses  on  learning  dense  cluster  structure,  can  be  formu¬ 
lated  as  a  special  case  of  the  SCC  model.  We  propose  the 
following  theorem  to  show  that  the  various  graph  partition¬ 
ing  objective  functions  are  mathematically  equivalent  to  a 
special  case  of  the  SCC  model.  Since  most  graph  parti¬ 
tioning  objective  functions  are  based  on  the  hard  cluster 
membership,  in  the  following  theorem  we  modify  the  con¬ 
straints  on  C  as  C  £  R+  and  CTC  =  Ik  to  make  C  to  be 
the  following  cluster  indicator  matrix, 


Cij  = 


if  Vi  €  7 Tj 

otherwise 


where  \itj  |  denotes  the  number  of  nodes  in  the  jth  cluster. 

Theorem  3.2.  The  hard  version  of  SCC  model  under 
Euclidean  distance  fun  ction  and  B  =  rlkforr  >  0,  i.e., 


min 

ceK"x't,seR^xfe 

cTc=/fc 


\\A-C{rIk)CT\\2 


(2) 


is  equivalent  to  the  maximization 

ma xtr(CTAC),  (3) 

where  tr  denots  the  trace  of  a  matrix. 

Proof.  Let  L  denote  the  objective  function  in  Eq.  2. 

L  =  \\A-rCCT\\2  (4) 

=  tr((A  —  rCCT)T(A  —  rCCT))  (5) 

=  tr {AtA)  -  2rtr(CCT  A)  +  r2tr(CCT CCTf6) 
=  tr {AtA)  -  2rtr(CTAC)  +  r2k  (7) 

The  above  deduction  uses  the  property  of  trace  trf  A'  Y)  = 
tr (YX).  Since  tr (ATA),  r  and  k  are  constants,  the 
minimization  of  L  is  equivalent  to  the  maximization  of 
tx{CT  AC).  The  proof  is  completed.  □ 


Theorem  3.2  states  that  with  the  prototype  matrix  B  re¬ 
stricted  to  be  of  the  form  rlk,  SCC  under  Euclidean  dis¬ 
tance  is  reduced  to  the  trace  maximization  in  (3).  Since  var¬ 
ious  graph  partitioning  objectives,  such  as  ratio  association 
(Shi  &  Malik,  2000),  normalized  cut  (Shi  &  Malik,  2000), 
ratio  cut  (Chan  et  ah,  1993),  and  Kernighan-Lin  objective 
(Kernighan  &  Lin,  1970),  can  be  formulated  as  the  trace 
maximization  (Dhillon  et  al.,  2004;  Dhillon  et  al.,  2005), 
Theorem  3.2  establishes  the  connection  between  the  SCC 
model  and  the  existing  graph  partitioning  objective  func¬ 
tions.  Based  on  this  connection,  it  is  clear  that  the  existing 
graph  partitioning  models  make  an  implicit  assumption  for 
the  cluster  structure  of  the  relational  data,  i.e.,  the  clusters 
are  not  related  to  each  other  (the  off-diagonal  elements  of 
B  are  zeroes)  and  the  nodes  within  clusters  are  related  to 
each  other  in  the  same  way  (the  diagonal  elements  of  B  are 
r).  This  assumption  is  consistent  with  the  intuition  about 
the  graph  partitioning,  which  seeks  to  ’’cut”  the  graph  into 
k  separate  subgraphs  corresponding  to  the  strongly-related 
clusters. 

With  Theorem  3.2  we  may  put  other  types  of  structural  con¬ 
straints  on  B  to  derive  new  graph  partitioning  models.  For 
example,  we  fix  B  as  a  general  diagonal  matrix  instead  of 
rlk,  i.e,  the  model  fixes  the  off-diagonal  elements  of  B 
as  zero  and  learns  the  diagonal  elements  of  B.  This  is  a 
more  flexible  graph  partitioning  model,  since  it  allows  the 
connectivity  within  different  clusters  to  be  different.  More 
generally,  we  can  use  B  to  restrict  the  model  to  learn  other 
types  of  the  cluster  structures.  For  example,  by  fixing  diag¬ 
onal  elements  of  B  as  zeros,  the  model  focuses  on  learning 
only  spare  clusters  (corresponding  to  bi-partite  or  k-partite 
subgraphs),  which  are  important  for  Web  community  learn¬ 
ing  (Kumar  et  al.,  1999;  Henzinger  et  al.,  2003).  In  sum¬ 
mary,  the  prototype  matrix  B  not  only  provides  the  intu¬ 
ition  for  the  cluster  structure  of  the  data,  but  also  provides 
a  simple  way  to  adapt  the  model  to  learn  specific  types  of 
cluster  structures. 


4.  Algorithm  Derivation 

In  this  section,  we  derive  efficient  algorithms  for  the  SCC 
model  under  two  popular  distance  functions,  Euclidean  dis¬ 
tance  and  generalized  I-divergence. 

4.1.  Algorithm  for  SCC  under  Euclidean  Distance 

We  derive  an  alternative  optimization  algorithm  for  SCC 
under  Euclidean  distance,  i.e.,  the  algorithm  alternatively 
updates  B  and  C  until  convergence. 

First  we  fix  B  to  update  C .  To  deal  with  the  constraint 
Cl  =  1  efficiently,  we  transform  it  to  a  ’’soft”  constraint 
by  adding  a  penalty  term,  a||Cl  —  1||2,  to  the  objective 
function,  where  a  is  a  positive  constant.  Therefore,  we 
obtain  the  following  optimization. 

min  ||A  —  Ci?CT||2  +  a||Cl  —  1 1 1 2 .  (8) 

ceR”x,s 
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The  objective  function  in  (8)  is  quartic  with  respect  to  G. 
We  derive  an  efficient  updating  rule  for  G  based  on  the 
bound  optimization  procedure  (Salakhutdinov  &  Roweis, 
2003;  D.D.Lee  &  H.S.Seung,  1999).  The  basic  idea  is 
to  construct  an  auxiliary  function  which  is  a  convex  upper 
bound  for  the  original  objective  function  based  on  the  solu¬ 
tion  obtained  from  the  previous  iteration.  Then,  a  new  solu¬ 
tion  to  the  current  iteration  is  obtained  by  minimizing  this 
upper  bound.  The  definition  of  the  auxiliary  function  and  a 
useful  lemma  (D.D.Lee  &  H.S.Seung,  1999)  are  quoted  as 
follows. 

Definition  4.1.  G(S,  S *)  is  an  auxiliary  function  for  F(S) 
if  G(S,  S*)  >  F(S)  and  G(S,  S)  =  F(S). 

Lemma  4.2.  If  G  is  an  auxiliary  function,  then  F 
is  non-increasing  under  the  updating  rule  St+1  = 
argmjn  G(S,S*). 


+  log  Cjh  -  log  Cig  -  log  Cjh)  +  0Cjh(  1  +  logCj-h  - 
;Cjh))  +  'Z,d[CBCT]ijCigBghCjh(<£-  + 

Uig  °jh 


gh 

+  xd\CA]jCjh{-A - f  1))) 

2  Lih 

—  'y  '(Ajj  +  k(3  —  2  y  ^{AjjCjgBghCjhiy  +  2  log  Cjh 

ij  gh 

-2  log  Cjh)  +  (3Cjh(  1  +  log  Cjh  ~  log  Cjh))  + 

c4 

J2([CBCT]i:iCi9BahCjh-JF  + 

gh  Cjh 

2  Cjh 


During  the  above  deduction,  we  uses  Jensen’s  inequality, 
convexity  of  the  quadratic  function  and  inequalities,  x2  + 

y2  >  2 xy  and  x  >  1  +  log  x.  □ 


We  propose  an  auxiliary  function  for  C  in  the  following 
theorem. 

Lemma  4.3. 


G(C,  C) 


)•  ’(A  jj  +  —  —  2  f  '(AjjCigBghCjhjl  +  2  log  Cjh 
ij  gh 

-21og  Cjh)  +  —j_Cjh(l  +  log  Cjh  -  log  Cjh))  + 


J2aCBCT]ijCigBghCjh^F  + 


■'jh 


gh 


£-k[ci]jCjh(#  +  m 

jh 


The  following  theorem  provides  the  updating  rule  for  C. 

Theorem  4.4.  The  objective  function  F(C )  in  Eq.( 9)  is 
nonincreasing  under  the  updating  rule, 


C 


C  0  ( 


ACB+  § 


CBCTCB  +  ^CE 


(10) 


where  C  denotes  the  solution  from  the  previous  iteration,  E 
denotes  a  k  x  k  matrix  of  1  ’s,  ©  denotes  entry-wise  prod¬ 
uct,  and  the  division  between  two  matrices  is  entry-wise 
division. 


is  an  auxiliary  function  for 

F(C)  =  ||A  —  CBCt\\2  +  a\\Cl  —  1||2.  (9) 


Proof.  Based  on  Lemma  4.3,  take  the  derivative  of 
G(G,  C)  w.r.t.  Cjh  to  obtain 


Proof.  For  convenience,  we  let  ft  = 


nk  * 


F(C)  =  JfdAjj-JfCjgBghCjhf+pJftCjh-  l)2 

ij  gh  gh 

^  V"  CigBghCjh  ,  „  [CBCT]ij  j 

S  -  A  „  A  CigBghCjh) 


[CBC7 


Cig  BghCj h 


+*£? ir([fk^-i)2) 

gh  C  *-\j  '-'jh 

=  y  '(Ajj  —  2  y  ]  AjjCjgBghCjh  + 

ij  gh 


£ 


gh  CigBghCjh 
-2  pY^Cjh  +  kp) 


CKhCf  •  -£  .' 


gh  ^h 


jh 


=  Yf(Aij+k0-2jf{AijCigBghCjh^^  + 

ij  gh  CigOjh 

PCjh  S^)  +  J2aCBCT]ijCigBghCjhCigCjh 


■'jh  gh 


+P[CX)jCjhffF)) 

Cjh 

<  y  ](Ajj  +  k(3  —  2  y  J^AjjCjgB ghC jhiy  +  log 
ij  gh 


C^g  Cjh 


dG{C,C)  a  A  r  ft  Cjh  o  a  Cjh 

~4Aij Cig BahWh~  nkCjh 


+4[CBCT]ijCigBgh 


Ch 

Ch 


Solve  1  =  0  to  obtain 


dCjh 


Cjh  —  Cjh( 


£i  £<jll  AijCigBgh  + 


£i  ZghlCBC^ijCigBgH  +  f  [Cl]j  ‘ 

Formulate  the  above  equation  into  the  matrix  form 

-  ,  ACB  +  f 

C  =  G  0  (  ~  -  - - i 

CBCTCB+  %CE 

By  Lemma  4.2,  the  proof  is  completed. 


□ 


Similarly,  we  present  the  following  theorems  to  derive  the 
updating  rule  for  B. 
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Algorithm  1  SCC-ED  algorithm 

Input:  A  graph  affinity  matrix  A  and  a  positive  integer  k. 
Output:  A  community  membership  matrix  C  and  a  com¬ 
munity  structure  matrix  B. 

Method: 

1 :  Initialize  B  and  C. 

2:  repeat 

,,  CTAC 

CTCBCTC 

4: 


c 


C  Q  ( 


ACB+  § 


CBCTCB 


ICE ' 


5:  until  convergence 


Lemma  4.5. 

G(B,B) 


AijCigBghCjh  ~\~ 

ij  gh 


Bi 


^[CBC^CigCgb^) 

Bgh 


gh 


Compared  with  the  multi-level  approaches  such  as  METIS 
(Karypis  &  Kumar,  1998),  this  new  algorithm  does  not  re¬ 
strict  clusters  to  have  an  equal  size. 

Another  advantage  of  the  SCC-ED  algorithm  is  that  it  is 
very  easy  for  the  algorithm  to  incorporate  constraints  on  B 
to  learn  a  specific  type  of  cluster  structures.  For  example, 
if  the  task  is  to  learn  the  sparse  clusters  by  constraining 
the  diagonal  elements  of  B  to  be  zero,  we  can  enforce  this 
constraint  simply  by  initializing  the  diagonal  elements  of 
B  as  zeros.  Then,  the  algorithm  automatically  only  updates 
the  off-diagonal  elements  of  B  and  the  diagonal  elements 
of  B  are  ’locked’  to  zeros. 


Yet  another  interesting  observation  about  SCC-ED  is  that 
if  we  set  a  =  0  to  change  the  updating  rule  for  C  into  the 
following, 


C 


CQ( 


ACB 

CBCTCB 


(13) 


the  algorithm  actually  provides  the  symmetric  conic  cod¬ 
ing.  This  has  been  touched  in  the  literature  as  the  symmet¬ 
ric  case  of  non-negative  factorizaion  (Catral  et  al.,  2004; 
Ding  et  ah,  2005;  Long  et  ah,  2005).  Therefore,  SCC-ED 
under  a  =  0  also  provides  a  theoretically  sound  solution  to 
the  symmetric  nonnegative  matrix  factorization. 


is  an  auxiliary  function  for 

F(B)  =  \\A-CBCT\\2.  (11) 


Theorem  4.6.  The  objective  function  F(B )  in  Eq.  (11)  is 
nonincreasing  under  the  updating  rule 


B  =  BQ 


CT  AC 
CTCBCTC 


(12) 


4.2.  Algorithm  for  SCC  under  Generalized 
I-divergence 

Under  the  generalized  I-divergence,  the  SCC  objective 
function  is  given  as  follows, 

d{A\\cbct)  = 

—Aij  +  BCT]if)  (14) 


Following  the  way  to  prove  Lemma  4.3  and  Theorem  4.4,  it 
is  not  difficult  to  prove  the  above  theorems.  We  omit  details 
here. 

We  call  the  algorithm  as  the  SCC-ED  algorithm,  which  is 
summarized  in  Algorithm  1 .  The  implementation  of  SCC- 
ED  is  simple  and  it  is  easy  to  take  advantage  of  the  distrib¬ 
uted  computation  for  a  very  large  data  set.  The  complexity 
of  the  algorithm  is  0(tn2k )  for  t  iterations  and  it  can  be 
further  reduced  for  sparse  data.  The  convergence  of  the 
SCC-ED  algorithm  is  guaranteed  by  Theorems  4.4  and  4.6. 

If  the  task  is  to  learn  the  dense  clusters  from  similarity- 
based  relational  data  as  the  graph  partitioning  does,  SCC- 
ED  can  achieve  this  task  simply  by  fixing  B  as  the  identity 
matrix  and  updating  only  C  by  (10)  until  convergence.  In 
other  words,  updating  rule  ( 10)  itself  provides  a  new  and  ef¬ 
ficient  graph  partitioning  algorithm,  which  is  computation¬ 
ally  more  efficient  than  the  popular  spectral  graph  partition¬ 
ing  approaches  which  involve  expensive  eigenvector  com¬ 
putation  (typically  0(n3))  and  the  extra  post-processing 
(Yu  &  Shi,  2003)  on  eigenvectors  to  obtain  the  clustering. 


Similarly,  we  derive  an  alternative  bound  optimization  al¬ 
gorithm  for  this  objective  function.  First,  we  derive  the 
updating  rule  for  C  and  our  task  is  the  following  optimiza¬ 
tion. 

min  D(A\\CBCT)+a\\Cl-l\\2.  (15) 

ceR”xfc 


Then,  the  following  theorems  provide  the  updating  rule  for 

C. 

Lemma  4.7. 


G(C,  C)  =  V  (Aij  log  Aij  -  Aij  +  - 

z ^  n 


+A  ij  Yf 


CigBghC jh  ,  CigCjh 


gh 


[CBCT] 


log 


[CBCT] 


Y((CigBghCjh  +  ^(Cl  \jCjh)C^) 


gh 

-2£«a 


gh 


CigBghCjh 

[CBCT]ij  nk  3  ’  3  1 


24 


Relational  Clustering  by  Symmetric  Convex  Coding 


gh 


is  an  auxiliary  function  for 

F(C)  =  D(A\\CBCt)  +  a||CT  —  1||2.  (16) 

Theorem  4.8.  The  objective  function  F(C )  in  Eq.(  1 6)  is 
nonincreasing  under  the  updating  rule. 


Table  1 .  Summary  of  the  synthetic  relational  data 


Graph 

Parameter 

n 

k 

synl 

0.5  0  0 

0  0.5  0 

0  0  0.5 

900 

3 

syn2 

1  —  synl 

900 

3 

syn3 

0  0.1  0.1 

0.1  0  0.2 

0.1  0.2  0 

900 

3 

syn4 

[0,1]1UX1U 

5000 

10 

Cjh  —  Cjh( 


E, 


Atj  [CB]jh 
[CBCT]ij 


Ei[CB]ih  +  a[Cl] 


(17) 


where  C  denotes  the  solution  from  the  previous  iteration. 

The  following  theorems  provide  the  updating  rule  for  B. 

Lemma  4.9. 


G(B,  B)  —  f  ](Ajj  log  Ai:j  —  Aij  +  f  CigBghCjh 

ij  gh 


gh 


Cig  Bg  h  Cj  h 
[CBCTfg 


(log  CigBghCjh 


-log 


CigBghCjh 

[CBCT]ij 


))) 


is  an  auxiliary  function  for 

F(B)  =  D(A\\CBCt).  (18) 

Theorem  4.10.  The  objective  function  F{B)  in  Eq.  (18)  is 
nonincreasing  under  the  updating  rule. 


5.1.  Data  Sets  and  Parameter  Setting 

The  data  sets  used  in  the  experiments  include  synthetic  data 
sets  with  various  cluster  structures  and  real  data  sets  based 
on  various  text  data  from  the  20-newsgroups  (Lang,  1995), 
WebACE  and  TREC  (Karypis,  2002). 

First,  we  use  synthetic  binary  relational  data  to  simulate  re¬ 
lational  data  with  different  types  of  clusters  such  as  dense 
clusters,  sparse  clusters  and  mixed  clusters.  All  the  syn¬ 
thetic  relational  data  are  generated  based  on  Bernoulli 
distribution.  The  distribution  parameters  to  generate  the 
graphs  are  listed  in  the  second  column  of  Table  1  as  matri¬ 
ces  (true  prototype  matrices  for  the  data).  In  a  parameter 
matrix  P,  Pi;/  denotes  the  probability  that  the  nodes  in  the 
ith  cluster  are  connected  to  the  nodes  in  the  jth  cluster.  For 
example,  in  data  syn3,  the  nodes  in  cluster  2  are  connected 
to  the  nodes  in  cluster  3  with  probability  0.2  and  the  nodes 
within  a  cluster  are  connected  to  each  other  with  probability 
0.  Syn2  is  generated  by  using  1  minus  synl.  Hence,  synl 
and  syn2  can  be  viewed  as  a  pair  of  similarity/dissimilarity 
data.  Data  syn4  has  ten  clusters  mixing  with  dense  clusters 
and  sparse  clusters.  Due  to  the  space  limit,  its  distribution 
parameters  are  omitted  here.  Totally  syn4  has  5000  nodes 
and  about  2.1  million  edges. 


Bgh 


EAij  CiqC-jh 

ij  [CBCT]ij 
'gh 


(19) 


where  B  denotes  the  solution  from  the  previous  iteration. 

Due  to  the  space  limit,  we  omit  the  proofs  for  the  above 
theorems.  We  call  the  algorithm  based  on  updating  rule 
(17)  and  (19)  as  SCC-GI,  which  provides  another  new  rela¬ 
tional  clustering  algorithm.  Similarly,  when  applied  to  the 
similarity-based  relational  data  of  dense  clusters,  SCC-GI 
provides  another  new  and  efficient  graph  partitioning  algo¬ 
rithm. 


5.  Experimental  Results 

This  section  provides  empirical  evidence  to  show  the  ef¬ 
fectiveness  of  the  SCC  model  and  algorithms  in  compari¬ 
son  with  two  representative  graph  partitioning  algorithms, 
a  spectral  approach,  Normalized  Cut  (NC)  (Shi  &  Malik, 
2000),  and  a  multilevel  algorithm,  METIS  (Karypis  &  Ku¬ 
mar,  1998). 


The  graphs  based  on  the  text  data  have  been  widely  used 
to  test  graph  partitioning  algorithms  (Ding  et  ah,  2001; 
Dhillon,  2001;  Zha  et  al.,  2001).  Note  that  there  also  ex¬ 
ist  feature-based  algorithms  to  directly  cluster  documents 
based  on  word  features.  However,  in  this  study  our  focus 
is  clustering  based  on  relations  instead  of  features.  Hence 
graph  clustering  algorithms  are  used  as  comparisons.  We 
use  various  data  sets  from  the  20-newsgroups  (Lang,  1995), 
WebACE  and  TREC  (Karypis,  2002),  which  cover  data  sets 
of  different  sizes,  different  balances  and  different  levels  of 
difficulties.  We  construct  relational  data  for  each  text  data 
set  such  that  objects  (documents)  are  related  to  each  other 
with  cosine  similarities  between  the  term-frequency  vec¬ 
tors.  A  summary  of  all  the  data  sets  to  construct  relational 
data  used  in  this  paper  is  shown  in  Table  2,  in  which  n 
denotes  the  number  of  objects  in  the  relational  data,  k  de¬ 
notes  the  number  of  true  clusters,  and  balance  denotes  the 
size  ratio  of  the  smallest  clusters  to  the  largest  clusters. 

For  the  number  of  clusters  k,  we  simply  use  the  number  of 
the  true  clusters.  Note  that  how  to  choose  the  optimal  num¬ 
ber  of  clusters  is  a  nontrivial  model  selection  problem  and 
beyond  the  scope  of  this  paper.  For  performance  measure, 
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Table  2.  Summary  of  relational  data  based  on  text  data  sets. 


Name 

n 

k 

Balance 

Source 

trl  1 

414 

9 

0.046 

TREC 

tr23 

204 

6 

0.066 

TREC 

NG17-19 

1600 

3 

0.5 

20-newsgroups 

NG1-20 

14000 

20 

1.0 

20-newsgroups 

klb 

2340 

6 

0.043 

WebACE 

hitech 

2301 

6 

0.192 

TREC 

classic3 

3893 

3 

0.708 

MEDLINE/ 

CISI/CRANFILD 

we  elect  to  use  the  Normalized  Mutual  Information  (NMI) 
(Strehl  &  Ghosh,  2002)  between  the  resulting  cluster  labels 
and  the  true  cluster  labels,  which  is  a  standard  way  to  mea¬ 
sure  the  cluster  quality.  The  final  performance  score  is  the 
average  of  ten  runs. 

5.2.  Results  and  Discussion 

Table  3  shows  the  NMI  scores  of  the  four  algorithms  on 
synthetic  and  real  relational  data.  Each  NMI  score  is  the 
average  of  ten  test  runs  and  the  standard  deviation  is  also 
reported.  We  observe  that  although  there  is  no  single  win¬ 
ner  on  all  the  data,  for  most  data  SCC  algorithms  perform 
better  than  or  close  to  NC  and  METIS.  Especially,  SCC-GI 
provides  the  best  performance  on  eight  of  the  eleven  data 
sets. 

For  the  synthetic  data  synl,  almost  all  the  algorithms  pro¬ 
vide  perfect  NMI  score,  since  the  data  are  generated  with 
very  clear  dense  cluster  structures,  which  can  be  seen  from 
the  parameter  matrix  in  Table  1 .  For  data  syn2,  the  dissim¬ 
ilarity  version  of  synl,  we  use  exactly  the  same  set  of  true 
cluster  labels  as  that  of  synl  to  measure  the  cluster  quality; 
the  SCC  algorithms  still  provide  almost  perfect  NMI  score; 
however,  the  METIS  totally  fails  on  syn2,  since  in  syn2  the 
clusters  have  the  form  of  sparse  clusters,  and  based  on  the 
edge  cut  objective,  METIS  looks  for  only  dense  clusters. 
An  interesting  observation  is  that  the  NC  algorithm  does 
not  totally  fail  on  syn2  and  in  fact  it  provides  a  satisfac¬ 
tory  NMI  score.  This  is  due  to  that  although  the  original 
objective  of  the  NC  algorithm  focuses  on  dense  clusters 
(its  objective  function  can  be  formulated  as  the  trace  max¬ 
imization  in  Eq.  (3)),  after  relaxing  C  to  an  arbitrary  or¬ 
thonormal  matrix,  what  NC  actually  does  is  to  embed  clus¬ 
ter  structures  into  the  eigen-space  and  to  discover  them  by 
post-processing  the  eigenvectors.  Besides  the  dense  cluster 
structures,  sparse  cluster  structures  could  also  have  a  good 
embedding  in  the  eigen-space  under  a  certain  condition. 

In  data  syn3,  the  relations  within  clusters  are  sparser  than 
the  relations  between  clusters,  i.e.,  it  also  has  sparse  clus¬ 
ters,  but  the  structure  is  more  subtle  than  syn2.  We  ob¬ 
serve  that  NC  does  not  provide  a  satisfactory  performance 
and  METIS  totally  fails;  in  the  mean  time,  SCC  algorithms 
identify  the  cluster  structure  in  syn3  very  well.  Data  syn4  is 
a  large  relational  data  set  of  ten  clusters  consisting  of  four 
dense  clusters  and  six  sparse  clusters;  we  observe  that  the 
SCC  algorithms  perform  significantly  better  than  NC  and 


METIS  on  it,  since  they  can  identify  both  dense  clusters 
and  sparse  clusters  at  the  same  time. 

For  the  real  data  based  on  the  text  data  sets,  our  task  is 
to  find  dense  clusters,  which  is  consistent  with  the  objec¬ 
tives  of  graph  partitioning  approaches.  Overall,  the  SCC 
algorithms  perform  better  than  NC  and  METIS  on  the  real 
data  sets.  Especially,  SCC -ED  provides  the  best  perfor¬ 
mance  in  most  data  sets.  The  possible  reasons  for  this  are 
discussed  as  follows.  First,  the  SCC  model  makes  use  of 
any  possible  block  pattern  in  the  relation  matrices;  on  the 
other  hand,  the  edge-cut  based  approaches  focus  on  diago¬ 
nal  block  patterns.  Hence,  the  SCC  model  is  more  robust  to 
heavily  overlapping  cluster  structures.  For  example,  for  the 
difficult  NG17-19  dataset,  SCC  algorithms  do  not  totally 
fail  as  NC  and  METIS  do.  Second,  since  the  edge  weights 
from  different  graphs  may  have  very  different  probabilistic 
distributions,  popular  Euclidean  distance  function,  which 
corresponds  to  normal  distribution  assumption,  are  not  al¬ 
ways  appropriate.  By  Theorem  3.2,  edge-cut  based  algo¬ 
rithms  are  based  on  Euclidean  distance.  On  the  other  hand, 
SCC-ED  is  based  on  generalized  I-divergence  correspond¬ 
ing  to  Poisson  distribution  assumption,  which  is  more  ap¬ 
propriate  for  graphs  based  on  text  data.  Note  that  how  to 
choose  distance  functions  for  specific  graphs  is  non-trivial 
and  beyond  the  scope  of  this  paper.  Third,  unlike  METIS, 
the  SCC  algorithms  do  not  restrict  clusters  to  have  an  equal 
size  and  hence  they  are  more  robust  to  unbalanced  clusters. 

In  the  experiments,  we  observe  that  SCC  algorithms  per¬ 
forms  stably  and  rarely  provides  unreasonable  solution, 
though  like  other  algorithms  SCC  algorithms  provide  local 
optima  to  the  NP-hard  clustering  problem.  In  the  experi¬ 
ments,  we  also  observe  that  the  order  of  the  actual  running 
time  for  the  algorithms  is  consistent  with  theoretical  analy¬ 
sis  in  Section  4.1,  i.e.,  METIS<SCC<NC.  For  example, 
in  a  test  run  on  NG1-20,  METIS,  SCC-ED,  SCC-GI  and 
NC  take  8.96,  11.4,  12.1  and  35.8  seconds,  respectively. 
METIS  is  the  best,  since  it  is  quasi-linear. 

We  also  run  the  SCC-ED  algorithm  on  the  actor/actress 
graph  based  on  IMDB  movie  data  set  for  a  case  study  of 
social  network  analysis.  We  formulate  a  graph  of  20000 
nodes,  in  which  each  node  represents  an  actors/actresses 
and  the  edges  denote  collaboration  between  them.  The 
number  of  the  cluster  is  set  to  be  200.  Although  there  is 
no  ground  truth  for  the  clusters,  we  observe  that  the  re¬ 
sults  consist  of  a  large  number  of  interesting  and  meaning¬ 
ful  clusters,  such  as  clusters  of  actors  with  a  similar  style 
and  tight  clusters  of  the  actors  from  a  movie  or  a  movie  ser¬ 
ial.  For  example,  Table  4  shows  Community  121  consisting 
of  21  actors/actresses,  which  contains  the  actors/actresses 
in  movies  series  ’’The  Lord  of  Rings”. 

6.  Conclusions 

In  this  paper,  we  propose  a  general  model  for  relational 
clustering  based  on  symmetric  convex  coding  of  the  rela¬ 
tion  matrix.  The  proposed  model  is  applicable  to  the  gen- 
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Table  3.  NMI  comparisons  of  NC,  METIS,  SCC-ED  and  SCC-GI  algorithms 


Data 

NC 

METIS 

SCC-ED 

SCC-GI 

synl 

0.9652  ±0.031 

1.000  ±0.000 

1.000  ±0.000 

1.000  ±  0.000 

syn2 

0.8062  ±  0.52 

0.000  ±0.00 

0.9038  ±  0.045 

0.9753  ±0.011 

syn3 

0.636  ±0.152 

0.115  ±0.001 

0.915  ±0.145 

1.000  ±0.000 

syn4 

0.611  ±  0.032 

0.638  ±0.001 

0.711  ±0.043 

0.788  ±0.041 

trl  1 

0.629  ±0.039 

0.557  ±0.001 

0.6391  ±0.033 

0.661  ±  0.019 

tr23 

0.276  ±0.023 

0.138  ±0.004 

0.335  ±0.043 

0.312  ±0.099 

NG17-19 

0.002  ±0.002 

0.091  ±  0.004 

0.1752  ±0.156 

0.225  ±0.045 

NG1-20 

0.510  ±0.004 

0.526  ±0.001 

0.5041  ±0.156 

0.519  ±0.010 

klb 

0.546  ±  0.021 

0.243  ±0.000 

0.537  ±0.023 

0.591  ±  0.022 

hitech 

0.302  ±  0.005 

0.322  ±0.001 

0.319  ±0.012 

0.319  ±0.018 

classic3 

0.621  ±  0.029 

0.358  ±0.000 

0.642  ±  0.043 

0.822  ±0.059 

Table  4.  The  members  of  cluster  121  in  the  actor  graph 
Cluster  121 

Viggo  Mortensen.  Sean  Bean,  Miranda  Otto, 

Ian  Holm,  Brad  Dourif,  Cate  Blanchett, 

Ian  McKellen  ,Liv  Tyler  .  David  Wenham  , 
Christopher  Lee,  John  Rhys-Davies  ,  Elijah  Wood  , 
Bernard  Hill,  Sean  Astin,  Dominic  Monaghan, 

Andy  Serkis,  Karl  Urban  ,  Orlando  Bloom  , 

Billy  Boyd  John  Noble,  Sala  Baker 

eral  relational  data  with  various  types  of  clusters  and  unifies 
the  existing  graph  partitioning  models.  We  derive  iterative 
bound  optimization  algorithms  to  solve  the  symmetric  con¬ 
vex  coding  for  two  important  distance  functions,  Euclidean 
distance  and  generalized  1-divergence.  The  algorithms  are 
applicable  to  general  relational  data  and  at  the  same  time 
they  can  be  easily  adapted  to  learn  specific  types  of  cluster 
structures.  The  convergence  of  the  algorithms  is  theoreti¬ 
cally  guaranteed.  Experimental  evaluation  shows  the  effec¬ 
tiveness  and  the  great  potential  of  the  proposed  model  and 
algorithms. 
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ABSTRACT 

Relational  clustering  has  attracted  more  and  more  attention 
due  to  its  phenomenal  impact  in  various  important  appli¬ 
cations  which  involve  multi-type  interrelated  data  objects, 
such  as  Web  mining  ,  search  marketing,  bioinformatics,  cita¬ 
tion  analysis,  and  epidemiology.  In  this  paper,  we  propose 
a  probabilistic  model  for  relational  clustering,  which  also 
provides  a  principal  framework  to  unify  various  important 
clustering  tasks  including  traditional  attributes-based  clus¬ 
tering,  semi-supervised  clustering,  co-clustering  and  graph 
clustering.  The  proposed  model  seeks  to  identify  cluster 
structures  for  each  type  of  data  objects  and  interaction  pat¬ 
terns  between  different  types  of  objects.  Under  this  model, 
we  propose  parametric  hard  and  soft  relational  clustering 
algorithms  under  a  large  number  of  exponential  family  dis¬ 
tributions.  The  algorithms  are  applicable  to  relational  data 
of  various  structures  and  at  the  same  time  unifies  a  number 
of  stat-of-the-art  clustering  algorithms:  co-clustering  algo¬ 
rithms,  the  k-partite  graph  clustering,  and  semi-supervised 
clustering  based  on  hidden  Markov  random  fields. 

Categories  and  Subject  Descriptions:  E.4  [Coding 
and  Information  Theory]  :Data  compaction  and  compres¬ 
sion;  H.3.3[Information  search  and  Retrieval]  :Clustering; 
I.5.3[Pattern  Recognition]  :Clustering. 

General  Terms:  Algorithms. 

Keywords:  Clustering,  Relational  data,  Relational  clus¬ 
tering,  Semi-supervised  clustering,  EM-algorithm,  Bregman 
divergences,  Exponential  families. 

1.  INTRODUCTION 

Most  clustering  approaches  in  the  literature  focus  on  ’’flat” 
data  in  which  each  data  object  is  represented  as  a  fixed- 
length  attribute  vector  [38] .  However,  many  real-world  data 
sets  are  much  richer  in  structure,  involving  objects  of  multi¬ 
ple  types  that  are  related  to  each  other,  such  as  documents 
and  words  in  a  text  corpus,  Web  pages,  search  queries  and 
Web  users  in  a  Web  search  system,  and  shops,  customers, 
suppliers,  shareholders  and  advertisement  media  in  a  mar¬ 
keting  system. 
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In  general,  relational  data  contain  three  types  of  infor¬ 
mation,  attributes  for  individual  objects,  homogeneous  re¬ 
lations  between  objects  of  the  same  type,  heterogeneous  re¬ 
lations  between  objects  of  different  types.  For  example,  for 
a  scientific  publication  relational  data  set  of  papers  and  au¬ 
thors,  the  personal  information  such  as  affiliation  for  authors 
are  attributes;  the  citation  relations  among  papers  are  ho¬ 
mogeneous  relations;  the  authorship  relations  between  pa¬ 
pers  and  authors  are  heterogeneous  relations.  Such  data 
violate  the  classic  IID  assumption  in  machine  learning  and 
statistics  and  present  huge  challenges  to  traditional  cluster¬ 
ing  approaches.  An  intuitive  solution  is  that  we  transform 
relational  data  into  flat  data  and  then  cluster  each  type  of 
objects  independently.  However,  this  may  not  work  well  due 
to  the  following  reasons. 

First,  the  transformation  causes  the  loss  of  relation  and 
structure  information  [14].  Second,  traditional  clustering 
approaches  are  unable  to  tackle  influence  propagation  in 
clustering  relational  data,  i.e.,  the  hidden  patterns  of  differ¬ 
ent  types  of  objects  could  affect  each  other  both  directly  and 
indirectly  (pass  along  relation  chains).  Third,  in  some  data 
mining  applications,  users  are  not  only  interested  in  the  hid¬ 
den  structure  for  each  type  of  objects,  but  also  interaction 
patterns  involving  multi-types  of  objects.  For  example,  in 
document  clustering,  in  addition  to  document  clusters  and 
word  clusters,  the  relationship  between  document  clusters 
and  word  clusters  is  also  useful  information.  It  is  difficult  to 
discover  such  interaction  patterns  by  clustering  each  type  of 
objects  individually. 

Moreover,  a  number  of  important  clustering  problems, 
which  have  been  of  intensive  interest  in  the  literature,  can  be 
viewed  as  special  cases  of  relational  clustering.  For  example, 
graph  clustering  (partitioning)  [7,  42,  13,  6,  20,  28]  can  be 
viewed  as  clustering  on  singly-type  relational  data  consisting 
of  only  homogeneous  relations  (represented  as  a  graph  affin¬ 
ity  matrix);  co-clustering  [12,  2]  which  arises  in  important 
applications  such  as  document  clustering  and  micro-array 
data  clustering,  can  be  formulated  as  clustering  on  bi-type 
relational  data  consisting  of  only  heterogeneous  relations. 
Recently,  semi-supervised  clustering  [46,  4]  has  attracted 
significant  attention,  which  is  a  special  type  of  clustering  us¬ 
ing  both  labeled  and  unlabeled  data.  In  section  5,  we  show 
that  semi-supervised  clustering  can  be  formulated  as  clus¬ 
tering  on  singly-type  relational  data  consisting  of  attributes 
and  homogeneous  relations. 

Therefore,  relational  data  present  not  only  huge  challenges 
to  traditional  unsupervised  clustering  approaches,  but  also 
great  need  for  theoretical  unification  of  various  clustering 
tasks.  In  this  paper,  we  propose  a  probabilistic  model  for 
relational  clustering,  which  also  provides  a  principal  frame¬ 
work  to  unify  various  important  clustering  tasks  includ- 
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ing  traditional  attributes-based  clustering,  semi-supervised 
clustering,  co-clustering  and  graph  clustering.  The  pro¬ 
posed  model  seeks  to  identify  cluster  structures  for  each 
type  of  data  objects  and  interaction  patterns  between  dif¬ 
ferent  types  of  objects.  It  is  applicable  to  relational  data 
of  various  structures.  Under  this  model,  we  propose  para¬ 
metric  hard  and  soft  relational  clustering  algorithms  under 
a  large  number  of  exponential  family  distributions.  The 
algorithms  are  applicable  to  various  relational  data  from 
various  applications  and  at  the  same  time  unify  a  number 
of  stat-of-the-art  clustering  algorithms:  co-clustering  algo¬ 
rithms,  the  k-partite  graph  clustering,  Bregman  k-means, 
and  semi-supervised  clustering  based  on  hidden  Markov  ran¬ 
dom  fields. 

2.  RELATED  WORK 

Clustering  on  a  special  case  of  relational  data,  bi-type  rela¬ 
tional  data  consisting  of  only  heterogeneous  relations,  such 
as  the  word-document  data,  is  called  co-clustering  or  bi¬ 
clustering.  Several  previous  efforts  related  to  co-clustering 
are  model  based  [22,  23].  Spectral  graph  partitioning  has 
also  been  applied  to  bi-type  relational  data  [11,  25].  These 
algorithms  formulate  the  data  matrix  as  a  bipartite  graph 
and  seek  to  find  the  optimal  normalized  cut  for  the  graph. 
Due  to  the  nature  of  a  bipartite  graph,  these  algorithms 
have  the  restriction  that  the  clusters  from  different  types 
of  objects  must  have  one-to-one  associations.  Information- 
theory  based  co-clustering  has  also  attracted  attention  in 
the  literature.  [12]  proposes  a  co-clustering  algorithm  to 
maximize  the  mutual  information  between  the  clustered  ran¬ 
dom  variables  subject  to  the  constraints  on  the  number  of 
row  and  column  clusters.  A  more  generalized  co-clustering 
framework  is  presented  by  [2]  wherein  any  Bregman  diver¬ 
gence  can  be  used  in  the  objective  function.  Recently,  co¬ 
clustering  has  been  addressed  based  on  matrix  factorization. 
[35]  proposes  an  EM-like  algorithm  based  on  multiplicative 
updating  rules. 

Graph  clustering  (partitioning)  clusters  homogeneous  data 
objects  based  on  pairwise  similarities,  which  can  be  viewed 
as  homogeneous  relations.  Graph  partitioning  has  been  stud¬ 
ied  for  decades  and  a  number  of  different  approaches,  such 
as  spectral  approaches  [7,  42,  13]  and  multilevel  approaches 
[6,  20,  28],  have  been  proposed.  Some  efforts  [17,  43,  21,  21, 
1]  based  on  stochastic  block  modeling  also  focus  on  homo¬ 
geneous  relations. 

Compared  with  co-clustering  and  homogeneous-relation- 
based  clustering,  clustering  on  general  relational  data,  which 
may  consist  of  more  than  two  types  of  data  objects  with 
various  structures,  has  not  been  well  studied  in  the  liter¬ 
ature.  Several  noticeable  efforts  are  discussed  as  follows. 
[45,  19]  extend  the  the  probabilistic  relational  model  to  the 
clustering  scenario  by  introducing  latent  variables  into  the 
model;  these  models  focus  on  using  attribute  information  for 
clustering.  [18]  formulates  star-structured  relational  data 
as  a  star-structured  m-partite  graph  and  develops  an  al¬ 
gorithm  based  on  semi-definite  programming  to  partition 
the  graph.  [34]  formulates  multi-type  relational  data  as  K- 
partite  graphs  and  proposes  a  family  of  algorithms  to  iden¬ 
tify  the  hidden  structures  of  a  k-partite  graph  by  construct¬ 
ing  a  relation  summary  network  to  approximate  the  original 
k-partite  graph  under  a  broad  range  of  distortion  measures. 
The  above  graph-based  algorithms  do  not  consider  attribute 
information. 

Some  efforts  on  relational  clustering  are  based  on  induc¬ 
tive  logic  programming  [37,  24,  31].  Base  on  the  idea  of 
mutual  reinforcement  clustering,  [51]  proposes  a  framework 


(a)  (b)  (c) 


Figure  1:  Examples  of  the  structures  of  relational 
data. 

for  clustering  heterogeneous  Web  objects  and  [47]  presents 
an  approach  to  improve  the  cluster  quality  of  interrelated 
data  objects  through  an  iterative  reinforcement  clustering 
process.  There  are  no  sound  objective  function  and  theoret¬ 
ical  proof  on  the  effectiveness  and  correctness  (convergence) 
of  the  mutual  reinforcement  clustering.  Some  efforts  [26,  50, 

49,  5]  in  the  literature  focus  on  how  to  measure  the  similar¬ 
ities  or  choosing  cross-relational  attributes. 

To  summarize,  the  research  on  relational  data  clustering 
has  attracted  substantial  attention,  especially  in  the  special 
cases  of  relational  data.  However,  there  is  still  limited  and 
preliminary  work  on  general  relational  data  clustering. 

3.  MODEL  FORMULATION 

With  different  compositions  of  three  types  of  information, 
attributes,  homogeneous  relations  and  heterogeneous  rela¬ 
tions,  relational  data  could  have  very  different  structures. 
Figure  1  shows  three  examples  of  the  structures  of  relational 
data.  Figure  1(a)  refers  to  a  simple  bi-type  of  relational  data 
with  only  heterogeneous  relations  such  as  word-document 
data.  Figure  1(b)  represents  a  bi-type  data  with  all  types 
of  information,  such  as  actor-movie  data,  in  which  actors 
(type  1)  have  attributes  such  as  gender;  actors  are  related 
to  each  other  by  collaboration  in  movies  (homogeneous  rela¬ 
tions);  actors  are  related  to  movies  (type  2)  by  taking  roles  in 
movies  (heterogeneous  relations).  Figure  1(c)  represents  the 
data  consisting  of  companies,  customers,  suppliers,  share¬ 
holders  and  advertisement  media,  in  which  customers  (type 
5)  have  attributes. 

In  this  paper,  we  represent  a  relational  data  set  as  a  set 
of  matrices.  Assume  that  a  relational  data  set  has  m  dif¬ 
ferent  types  of  data  objects,  X1'1'1  =  }™=ii  •  ■  •  >  = 

{xim^}7=n  where  rij  denotes  the  number  of  objects  of  the 
jith  type  and  Xp'1  denotes  the  name  of  the  pth  object  of  the 
jth  type.  We  represent  the  observations  of  the  relational 
data  as  three  sets  of  matrices,  attribute  matrices  { F £ 
x»i  ;  where  dj  denotes  the  dimension  of  attributes 
for  the  jth  type  objects  and  F^  denotes  the  attribute  vec¬ 
tor  for  object  Xp ■*;  homogeneous  relation  matrices  {S^  £ 
xnj  }7Jl=1,  where  S^q  denotes  the  relation  between  Xp  J 
andrrjf'*;  heterogeneous  relation  matrices  (R^  £  R"iX"J}((t_1, 
where  Rpq  denotes  the  relation  between  Xp  and  Xq  .  The 
above  representation  is  a  general  formulation.  In  real  ap¬ 
plications,  not  every  type  of  objects  has  attributes,  homo¬ 
geneous  relations  and  heterogeneous  relations.  For  exam¬ 
ple,  the  relational  data  set  in  Figure  1(a)  is  represented  by 
only  one  heterogeneous  matrix  R(12^,  and  the  one  in  Figure 
1(b)  is  represented  by  three  matrices,  F(1^,  S(1^  and  R(12). 
Moreover,  for  a  specific  clustering  task,  we  may  not  use  all 
available  attributes  and  relations  after  feature  or  relation 
selection  pre-processing. 

Mixed  membership  models,  which  assume  that  each  ob¬ 
ject  has  mixed  membership  denoting  its  association  with 
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APPENDIX  C 


classes,  have  been  widely  used  in  the  applications  involving 
soft  classification  [16],  such  as  matching  words  and  pictures 
[39],  race  genetic  structures  [39,  48],  and  classifying  scientific 
publications  [15]. 

In  this  paper,  we  propose  a  relational  mixed  membership 
model  to  cluster  relational  data  (we  refer  to  the  model  as 
mixed  membership  relational  clustering  or  MMRC  through¬ 
out  the  rest  of  the  paper). 

Assume  that  each  type  of  objects  A  ■■’-1  has  kj  latent  classes. 
We  represent  the  membership  vectors  for  all  the  objects  in 
A^  as  a  membership  matrix  A^)  ^  [qi  l]fci  Xnt  such  that  the 
sum  of  elements  of  each  column  A(p'  is  1  and  A^p  denotes 
the  membership  vector  for  object  xp\  i.e.,  AgJJ  denotes  the 
probability  that  object  xp  associates  with  the  gth  latent 
class.  We  also  write  the  parameters  of  distributions  to  gen¬ 
erate  attributes,  homogeneous  relations  and  heterogeneous 
relations  in  matrix  forms.  Let  0^  £  Rdjxfej  denote  the 
distribution  parameter  matrix  for  generating  attributes  F  ■■’-1 
such  that  of;)'  denotes  the  parameter  vector  associated  with 
the  gth  latent  class.  Similarly,  £  Rfej' xkj  denotes  the  pa¬ 
rameter  matrix  for  generating  homogeneous  relations  S’ ^ ; 
Tib)  £  R kixkj  denotes  the  parameter  matrix  for  generating 
heterogeneous  relations  R^.  In  summary,  the  parameters 
of  MMRC  model  are 

o  =  {{A U)}?=1,{GU)}?=1,  {rU)}?=L  {Tw)}™=r}- 

In  general,  the  meanings  of  the  parameters,  0,  A,  and  T,  de¬ 
pend  on  the  specific  distribution  assumptions.  However,  in 
Section  4.1,  we  show  that  for  a  large  number  of  exponential 
family  distributions,  these  parameters  can  be  formulated  as 
expectations  with  intuitive  interpretations. 

Next,  we  introduce  the  latent  variables  into  the  model. 
For  each  object  xp,  a  latent  cluster  indicator  vector  is  gen¬ 
erated  based  on  its  membership  parameter  A”P ,  which  is 
denoted  as  Cfp\  i.e.,  £  {0,  l}kiXnJ  is  a  latent  indicator 

matrix  for  all  the  jth  type  objects  in  X^\ 

Finally,  we  present  the  generative  process  of  observations, 
{F0)}7=1,  {S(J)}™  i,  and  {R as  follows: 

1.  For  each  object  xp! 

•  Sample  ~  Multinomial ( A^p\  1). 

2.  For  each  object  xp  ^ 

•  Sample  F^  ~  Pr(F^)|0w)C^)). 

3.  For  each  pair  of  objects  xp  and  Xq 

•  Sample  S$  ~  Pr(S$ |(C^))Try)C^)). 

4.  For  each  pair  of  objects  Xp'1  and  Xq^ 

•  Sample  R^  ~  Pr(R^’)|(Cf*))TTW)C^)). 

In  the  above  generative  process,  a  latent  indicator  vector  for 
each  object  is  generated  based  on  multinomial  distribution 
with  the  membership  vector  as  parameters.  Observations 
are  generated  independently  conditioning  on  latent  indica¬ 
tor  variables.  The  parameters  of  condition  distributions  are 
formulated  as  products  of  the  parameter  matrices  and  latent 
indicators,  i.e.,  Pr(F(^)|C^),  0(j))  =  Pr(P.(/)|0(j')C^)), 
Pr(S^)|C^),C^'),r^)  =  Pr(S^)|(C^))Tr(^C^')),  and 
Pr(R^)|cf'),C^),Tw>)  =  Pr(R^)|(Cf’))TTw)C^')).  Un¬ 
der  this  formulation,  an  observation  is  sampled  from  the 


distributions  of  its  associated  latent  classes.  For  example, 
if  C ^p  indicates  that  xp ^  is  with  the  gth  latent  class  and 
C^q  indicates  that  x^  is  with  the  fctli  latent  class,  then 
(cfp)TT^j)C^)  =  T^.  Hence,  we  have  Pr(R^)|T^)) 
implying  that  the  relation  between  xp  '1  and  Xq P  is  sampled 
by  using  the  parameter  T^. 

With  matrix  representation,  the  joint  probability  distrib¬ 
ution  over  the  observations  and  the  latent  variables  can  be 
formulated  as  follows, 

m  m 

Pr('Fin)  =  Pr(Cu)|A0))  Pr(F°')|0(j')Co)) 

3  — 1  pfl 

m 

Y[Pr(Su)\(Cu))rTu)Cu))  (1) 

3  = 1 
m  m 

n  n  ■Pr(R(<j')i(c(i))TT(*j')c(-7')) 

i=  1  3=1 

where  T  =  {{C«}f=1,  {F«}f=1,  {S^}f=1,  {R«>}("=1}, 
FY(C^|A^)  =  fdpii  Multinomial (A^P\  1), 

Pr(Fu)  0:'!Cij))  =  1 1’^  ,  /'/•(F:(')i0UiC(;')), 

pr(su)|(cO))Tr^cw)  =  uZ=i  Pr(  s^Kc^fr^c^), 

and  similarly  for  R*-1^. 

4.  ALGORITHM  DERIVATION 

In  this  section,  based  on  the  MMRC  model  we  derive  para¬ 
metric  soft  and  hard  relational  clustering  algorithms  under 
a  large  number  of  exponential  family  distributions. 

4.1  MMRC  with  Exponential  Families 

To  avoid  clutter,  instead  of  general  relational  data,  we  use 
relational  data  similar  to  the  one  in  Figure  1(b),  which  is  a 
representative  relational  data  set  containing  all  three  types 
of  information  for  relational  data,  attributes,  homogeneous 
relations  and  heterogeneous  relations.  However,  the  deriva¬ 
tion  and  algorithms  are  applicable  to  general  relational  data. 

For  the  relational  data  set  in  Figure  1(b),  we  have  two 
types  of  objects,  one  attribute  matrix  F,  one  homogeneous 
relation  matrix  S  and  one  heterogeneous  relation  matrix  R. 
Based  on  Eq.(l),  we  have  the  following  likelihood  function, 

=  Pr(C(1)|A(1))Pr(C(2)|A(2))Pr(F|0C(1)) 

Pr(S|(C(1))TFC(1))Pr(R|(Cll))TTC(2))  ^ 

Our  goal  is  to  maximize  the  likelihood  function  in  Eq.  (2) 
to  estimate  unknown  parameters. 

For  the  likelihood  function  in  Eq.(2),  the  specific  forms  of 
condition  distributions  for  attributes  and  relations  depend 
on  specific  applications.  Presumably,  for  a  specific  likelihood 
function,  we  need  to  derive  a  specific  algorithm.  However, 
a  large  number  of  useful  distributions,  such  as  normal  dis¬ 
tribution,  Poisson  distribution,  and  Bernoulli  distributions, 
belong  to  exponential  families  and  the  distribution  functions 
of  exponential  families  can  be  formulated  as  a  general  form. 
This  nice  property  facilitates  us  to  derive  a  general  EM  al¬ 
gorithm  for  the  MMRC  model. 

It  is  shown  in  the  literature  [3,  9]  that  there  exists  bijection 
between  exponential  families  and  Bregman  divergences  [40] . 
For  example,  the  normal  distribution,  Bernoulli  distribution, 
multinomial  distribution  and  exponential  distribution  cor¬ 
respond  to  Euclidean  distance,  logistic  loss,  KL-divergence 
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and  Itakura-Satio  distance,  respectively.  Based  on  the  bi- 
jection,  an  exponential  family  density  Pr(x)  can  always  be 
formulated  as  the  following  expression  with  a  Bregman  di¬ 
vergence  D<f,, 

Pr(x)  =  exp  (— D0(x,/r))/^(x),  (3) 

where  /^(x)  is  a  uniquely  determined  function  for  each  ex¬ 
ponential  probability  density,  and  /x  is  the  expectation  para¬ 
meter.  Therefore,  for  the  MMRC  model  under  exponential 
family  distributions,  we  have  the  following, 

Pr(F|0C«)  =  expC-A^F,©^1)))/^)  (4) 

Pr(  S|(C«)TrC«)  =  exp(-D^2(S,(C(1))Trc(1)))/^2(S) 

(5) 

Pr-(R|(C(1>)TTC(2>)  =  exp(-D^3(R,(C(1))TTC(2)))/^3(R) 

(6) 

In  the  above  equations,  a  Bregman  divergence  of  two  matri¬ 
ces  is  defined  as  the  sum  of  the  Bregeman  divergence  of  each 
pair  of  elements  from  the  two  matrices.  Another  advantage 
of  the  above  formulation  is  that  under  this  formulation,  the 
parameters,  0,  A,  and  T,  are  expectations  of  intuitive  in¬ 
terpretations.  0  consists  of  center  vectors  of  attributes;  T 
provides  an  intuitive  summary  of  cluster  structure  within 
the  same  type  objects,  since  implies  expectation  rela¬ 
tions  between  the  gih  cluster  and  the  /ith  cluster  of  type 
1  objects;  similarly,  T  provides  an  intuitive  summary  for 
cluster  structures  between  the  different  type  objects.  In  the 
above  formulation,  we  use  different  Bregman  divergences, 
-D^j,  D,), 2,  and  I?03,  for  the  attributes,  homogeneous  re¬ 
lations  and  heterogeneous  relations,  since  they  could  have 
different  distributions  in  real  applications.  For  example, 
suppose  we  have  0(1'  =  jEjl  \\  j  for  normal  distribu¬ 
tion,  r(1)  =  “  g  j  J  for  Bernoulli  distribution  ,  and 

Y(12)  =  3  ^  j  for  Poisson  distribution,  then  the  cluster 

structures  of  the  data  are  very  intuitive.  First,  the  center 
attribute  vectors  for  the  two  clusters  of  type  1  are  j 

and  [  2.1  J  i  second,  by  F^  we  know  that  the  type  1  nodes 
from  different  clusters  are  barely  related  and  cluster  1  is 
denser  that  cluster  2;  third,  by  T^12^  we  know  that  cluster  1 
of  type  1  nodes  are  related  to  cluster  2  of  type  2  nodes  more 
strongly  than  to  cluster  1  of  type  2,  and  so  on  so  forth. 

Since  the  distributions  of  C(1)  and  C(2)  are  modeled  as 
multinomial  distributions,  we  have  the  following 

n\  ki 

pH  c(i)ia«)  =  nn(Oc-  m 

P=1  9=1 
n2  k2 

Pr{  C(2W2))  =  niM)4*'  (8) 

q=l  h=l 

Substituting  Eqs.  (4),  (5),  (6),  (7),  and  (8)  into  Eq,  (2) 
and  taking  some  algebraic  manipulations,  we  obtain  the  fol¬ 
lowing  log-likelihood  function  for  MMRC  under  exponential 
families, 

n  i  k  i  n2  fc2 

iog(£(n|®))  =  E  E  lQg  +  E  E  lo8 

p=  1  g= 1  q= 1  h= 1 

-D<tn  (F,  0C(1))  -  (S,  (c(1))Trc(1)) 

-D03(R,(C(1))tTC(2))  +  t 

(9) 


where  r  =  log  ftj>1(F)  +  log/02(S)  +  log/03(R),  which  is  a 
constant  in  the  log-likelihood  function. 

Expectation  Maximization  (EM)  is  a  general  approach 
to  find  the  maximum-likelihood  estimate  of  the  parameters 
when  the  model  has  latent  variables.  EM  does  maximum 
likelihood  estimation  by  iteratively  maximizing  the  expec¬ 
tation  of  the  complete  (log-)likelihood,  which  is  the  following 
under  the  MMRC  model, 

Q(fi,  «)  =  E[log(£(Fl|4/))|C(1),  C(2),  fi],  (10) 

where  If  denotes  the  current  estimation  of  the  parameters 
and  17  is  the  new  parameters  that  we  optimize  to  increase  Q. 
Two  steps,  E-step  and  M-step,  are  alternatively  performed 
to  maximize  the  objective  function  in  Eq.  (10). 

4.2  Monte  Carlo  E-step 

In  the  E-step,  based  on  Bayes’s  rule,  the  posterior  proba¬ 
bility  of  the  latent  variables, 

Pr(  C(1),C(2)|F,S,R,f7)  = 

Pr(  C(1),C(2),F,S,R|fi)  (11) 

EcW.cW  Pr(CW,  C®,  F,  S,  R|fi)  ’ 

is  updated  using  the  current  estimation  of  the  parameters. 
However,  conditioning  on  observations,  the  latent  variables 
are  not  independent,  i.e. ,  there  exist  dependencies  between 
the  posterior  probabilities  of  and  C®  ,  and  between 
those  of  and  C'!/1 .  Hence,  directly  computing  the  pos¬ 
terior  based  on  Eq.  (11)  is  prohibitively  expensive. 

There  exist  several  techniques  for  computing  intractable 
posterior,  such  as  Monte  Carlo  approaches,  belief  propa¬ 
gation,  and  variational  methods.  We  follow  a  Monte  Carlo 
approach,  Gibbs  sampler,  which  is  a  method  of  constructing 
a  Markov  chain  whose  stationary  distribution  is  the  distri¬ 
bution  to  be  estimated. 

It  is  easy  to  compute  the  posterior  of  a  latent  indicator 
vector  while  fixing  all  other  latent  indicator  vectors,  i.e., 

Pf(  c«|c(Ec(2),f,s,r,  fi)  = 

Pr(C(1),C(2),F,S,R|fi)  (12) 

Ecd'  Pr(CW,  C(2E  F,  S,  R|f2)  ’ 

P 

where  cE_p  denotes  all  the  latent  indicator  vectors  except 

for  C E  Therefore,  we  present  the  following  Markov  chain 
to  estimate  the  posterior  in  Eq.  (11). 

•  Sample 

from  distribution  Pr(C(11)|C(i)1,  C(2),  F,  S,  R,  $7); 

•  . 

•  Sample 

from  distribution  Pr(C(^\  j C(Eni ,  C(2) ,  F,  S,  R,  f2); 

•  Sample  C® 

from  distribution  Pr(C(2)|C(2.\ ,  C(J9,  F,  S,  R,  $7); 

•  . 

•  Sample  C)22 

from  distribution  Pr( Cf2)2|Cf2)„2,  C(1),  F,  S,  R,  17); 

Note  that  at  each  sampling  step  in  the  above  procedure, 
we  use  the  latent  indicator  variables  sampled  from  previous 
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steps.  The  above  procedure  iterates  until  the  stop  crite¬ 
rion  is  satisfied.  It  can  be  shown  that  the  above  procedure 
is  a  Markov  chain  converging  to  Pr (C^,  C^|F,  S,  R,  Cl). 
Assume  that  we  keep  l  samples  for  estimation;  then  the 
posterior  can  be  obtained  simply  by  the  empirical  joint  dis¬ 
tribution  of  C*-1-1  and  C<2'  in  the  l  samples. 

4.3  M-step 

After  the  E-step,  we  have  the  posterior  probability  of  la¬ 
tent  variables  to  evaluate  the  expectation  of  the  complete 
log-likelihood, 

Q(fi,n)=  log(T(ff|«'))Pr(C(1),C(2,|F,S,R,fi). 

CMbC*2) 

(13) 

In  the  M-step,  we  optimize  the  unknown  parameters  by 

FI*  =  argmaxQ(fi,a).  (14) 

First,  we  derive  the  update  rules  for  membership  parame¬ 
ters  A(1>  and  A(2\  To  derive  the  expression  for  each  A^, 
we  introduce  the  Lagrange  multiplier  a  with  the  constraint 
A-op  =  1 5  and  solve  the  following  equation, 

-^ly{Q(n,f))  +  a(^A«-l)}  =  0.  (15) 

hp  9  =  1 

Substituting  Eqs.  (9)  and  (13)  into  Eq.  (15),  after  some 
algebraic  manipulations,  we  have 

Pr( C«  =  1 1 F,  S,  R,  Cl)  -  =  0.  (16) 

Summing  both  sides  over  h,  we  obtain  a  =  1  resulting  in 
the  following  update  rule, 

Ahp  =  Pr(chp  =  1|F,  S,  R,  0),  (17) 

i.e.,  A^  is  updated  as  the  posterior  probability  that  the  pth 

object  is  associated  with  the  hth  cluster.  Similarly,  we  have 

(2) 

the  following  update  rule  for  Ayh^ 

A&)=Pr(C&)  =  l|F,S,R,n).  (18) 

Second,  we  derive  the  update  rule  for  0.  Based  on  Eqs. 
(9)  and  (13),  optimizing  ©  is  equivalent  to  the  following 
optimization, 


argmiii  ^  (F,  0C(1))Pr(C(1),  C(2) |F,  S,  R,  $7). 

cMfick2) 

(19) 

We  reformulated  the  above  expression  as, 


ki 

arg  min  EE  E  D^(F.p,Q.g)Pr(C^  =  l|F,S,R,ft). 
C(1)  9=Vc<T  =  l 


(20) 

To  solve  the  above  optimization,  we  make  use  of  an  im¬ 
portant  property  of  Bregman  divergence  presented  in  the 
following  theorem. 


Theorem  1.  Let  X  be  a  random  variable  taking  values 
in  X  =  {Xi}"=i  C  S  C  Rd  following  v.  Given  a  Bregman 
divergence  D $  :  S  x  int(S)  i— >  [0,oo),  the  problem 

min  Ev[D^{X,a)]  (21) 

has  a  unique  minimizer  given  by  s*  —  EV[X], 


The  proof  of  Theorem  1  is  omitted  (please  refer  [3,  40]). 
Theorem  1  states  that  the  Bregman  representative  of  a  ran¬ 
dom  variable  is  always  the  expectation  of  the  variable.  Based 
on  Theorem  1  and  the  objective  function  in  (20),  we  update 
0.9  as  follows, 

y-ni .  F.pPWCi1,)  =  1 1 F ,  S,  R,  Cl) 

0-9  =  P  (22) 
Epli  Pr(.C$  =  1|F,  S,  R,  fi) 

Third,  we  derive  the  update  rule  for  T.  Based  on  Eqs. 
(9)  and  (13),  we  formulate  optimizing  V  as  the  following 
optimization, 


fci  ki 


arg  min 
r 


EEE 

Cf1)  9=1  h=1 


E  D^s  pqi  ?gh)P, 

p:  C^pW, 


(23) 


where  p  denotes  Pr(C^J  =  1,  =  1 1 F,  S,  R,  Cl)  and  1  < 

p,q  <  n i.  Based  on  Theorem  1,  we  update  each  Ygh  as 
follows, 


Tgh  = 


E;,9=i  =  1,  cff  =  1 1 F ,  S,  R,  Cl) 

E^Wi  Pr(C$  =  i>  C «  =  1 1 F ,  S,  R,  n) 


(24) 


Fourth,  we  derive  the  update  rule  for  T.  Based  on  Eqs. 
(9)  and  (13),  we  formulate  optimizing  T  as  the  following 
optimization, 


fcl  /C2 


arg  mm 
T 


E  EE  E  D<l>3  (Rp<ji  P gh)Pi  (25) 


CMfiCk2)  9  =  1  h=1  p:C(1)=l, 


where  p  denotes  Pr( Cgp  =  1,  C®  =  1 1 F,  S,  R,  O)  ,  1  <  p  < 
m  and  1  <  q  <  U2-  Based  on  Theorem  1,  we  update  each 
rgh  as  follows, 

T  ^  Ep=i  E,=i  Rp'PrCcft  =  1,  cff  =  1|F,  s,  r,  Cl) 
Ep=i  Eg=i  Pr{ c£>  =  1,  c£>  =  1|F,  S,  R,  Cl) 

(26) 

Combining  the  E-step  and  M-step,  we  have  a  general  re¬ 
lational  clustering  algorithm,  Exponential  Family  MMRC 
(EF-MMRC)  algorithm,  which  is  summarized  in  Algorithm 
1.  Since  it  is  straightforward  to  apply  our  algorithm  deriva¬ 
tion  to  a  relational  data  set  of  any  structure,  Algorithm  1 
is  proposed  based  on  the  input  of  a  general  relational  data 
set.  Despite  that  the  input  relational  data  could  have  var¬ 
ious  structures,  EF-MMRC  works  simply  as  follows:  in  the 
E-step,  EF-MMRC  iteratively  updates  the  posterior  prob¬ 
abilities  that  an  object  is  associated  with  the  clusters  (the 
Markov  chain  in  Section  4.2);  in  the  M-step,  based  on  the 
current  cluster  association  (posterior  probabilities),  the  clus¬ 
ter  representatives  of  attributes  and  relations  are  updated 
as  the  weighted  mean  of  the  observations  no  matter  which 
exponential  distributions  are  assumed. 

Therefore,  with  the  simplicity  of  the  traditional  centroid- 
based  clustering  algorithms,  EF-MMRC  is  capable  of  mak¬ 
ing  use  of  all  attribute  information  and  homogeneous  and 
heterogenous  relation  information  to  learn  hidden  structures 
from  various  relational  data.  Since  EF-MMRC  simultane¬ 
ously  clusters  multi-type  interrelated  objects,  the  cluster 
structures  of  different  types  of  objects  may  interact  with 
each  other  directly  or  indirectly  during  the  clustering  process 
to  automatically  deal  with  the  influence  propagation.  Be¬ 
sides  the  local  cluster  structures  for  each  type  of  objects, 
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Algorithm  1  Exponential  Family  MMRC  Algorithm 
Input:  A  relational  data  set 

{{F0)}”Li,{S°')}”Li,{R(ij')}^=i},  a  set  of  exponen¬ 
tial  family  distributions  (Bregman  divergences)  assumed 
for  the  data  set. 

Output:  Membership  Matrices  {Al-,-)  }JL-i ,  attribute  expec¬ 
tation  matrices  {0^-)}"h1,  homogeneous  relation  expecta¬ 
tion  matrices  }!??_!,  and  heterogeneous  relation  expec¬ 
tation  matrices 

Method: 

1:  Initialize  the  parameters  as  U  = 

{Qu)y?=i,  {fo)}r=1,  {t  ■ 

2:  repeat 

3:  {E-step} 

4:  Compute  the  posterior 

Pr({C«>}™  {S^}f=1,  {R(ij>}rd=1,  n) 

using  the  Gibbs  sampler. 

5:  {M-step} 

6:  for  j  =  1  to  m  do 

7:  Compute  using  update  rule  (17). 

8:  Compute  0^  using  update  rule  (22). 

9:  Compute  using  update  rule  (24). 

10:  for  i  =  1  to  m  do 

11:  Compute  using  update  rule  (26). 

12:  end  for 

13:  end  for 

14:  Cl  =  Q 

15:  until  convergence 


the  output  of  EF-MMRC  also  provides  the  summary  of  the 
global  hidden  structure  for  the  data,  i.e.,  based  on  T  and 
Y,  we  know  how  the  clusters  of  the  same  type  and  differ¬ 
ent  types  are  related  to  each  other.  Furthermore,  relational 
data  from  different  applications  may  have  different  proba¬ 
bilistic  distributions  on  the  attributes  and  relations;  it  is 
easy  for  EF-MMRC  to  adapt  to  this  situation  by  simply  us¬ 
ing  different  Bregman  divergences  corresponding  to  different 
exponential  family  distributions. 

If  we  assume  0(m)  types  of  heterogeneous  relations  among 
m  types  of  objects,  which  is  typical  in  real  applications,  and 
let  n  =  O(ni)  and  k  =  Q(ki),  the  computational  complexity 
of  EF-MMRC  can  be  shown  to  be  0(tmn2k)  for  t  iterations. 
If  we  apply  the  k-means  algorithm  to  each  type  of  nodes  in¬ 
dividually  by  transforming  the  relations  into  attributes  for 
each  type  of  nodes,  the  total  computational  complexity  is 
also  0(tmn2k). 

4.4  Hard  MMRC  Algorithm 

Due  to  its  simplicity,  scalability,  and  broad  applicability, 
k-means  algorithm  has  become  one  of  the  most  popular  clus¬ 
tering  algorithms.  Hence,  it  is  desirable  to  extend  k-means 
to  relational  data.  Some  efforts  [47,  2,  12,  33]  in  the  lit¬ 
erature  work  in  this  direction.  However,  these  approaches 
apply  to  only  some  special  and  simple  cases  of  relational 
data,  such  as  bi-type  heterogeneous  relational  data. 

As  traditional  k-means  can  be  formulated  as  a  hard  ver¬ 
sion  of  Gaussian  mixture  model  EM  algorithm  [29],  we  pro¬ 
pose  the  hard  version  of  MMRC  algorithm  as  a  general  rela¬ 
tional  k-means  algorithm  (from  now  on,  we  call  Algorithm  1 
as  soft  EF-MMRC),  which  applies  to  various  relational  data. 

To  derive  the  hard  version  MMRC  algorithm,  we  omit  soft 
membership  parameters  A  *■ J  J  in  the  MMRC  model  (C^  in 


the  model  provides  the  hard  membership  for  each  object). 
Next,  we  change  the  computation  of  the  posterior  proba¬ 
bilities  in  the  E-step  to  reassignment  procedure,  i.e.,  in  the 
E-step,  based  on  the  estimation  of  the  current  parameters, 
we  re-assign  cluster  labels,  {C^j^Lr,  to  maximize  the  ob¬ 
jective  function  in  (9).  In  particular,  for  each  object,  while 
fixing  the  cluster  assignments  of  all  other  objects,  we  assign 
it  to  each  cluster  to  find  the  optimal  cluster  assignment  max¬ 
imizing  the  objective  function  in  (9),  which  is  equivalent  to 
minimizing  the  Bregman  distances  between  the  observations 
and  the  corresponding  expectation  parameters.  After  all 
objects  are  assigned,  the  re-assignment  process  is  repeated 
until  no  object  changes  its  cluster  assignment  between  two 
successive  iterations. 

In  the  M-step,  we  estimate  the  parameters  based  on  the 
cluster  assignments  from  the  E-step.  A  simple  way  to  derive 
the  update  rules  is  to  follow  the  derivation  in  Section  4.3  but 
replace  the  posterior  probabilities  by  its  hard  versions.  For 
example,  after  the  E-step,  if  the  object  Xp'*  is  assigned  to 
the  gth  cluster,  i.e.,  C g2J  =  1,  then  the  posterior  Pr( = 
1 1 F,  S,  R.  Cl)  -  1  and  Pr( =  1|F,  S,  R,  Cl)  =  0  for  h^g. 

Using  the  hard  versions  of  the  posterior  probabilities,  we 
derive  the  following  update  rule  for 


0 


O')  _ 


E 


0) 


p-.C 


0)=1F-p 


\^nj  (~iU) 
2^p=  1  ^9P 


(27) 


In  the  above  update  rule,  since  EEi  ^gp  is  the  size  of  the 

gth  cluster,  0:^  is  actually  updated  as  the  mean  of  the 
attribute  vectors  of  the  objects  assigned  to  the  gth  cluster. 
Similarly,  we  have  the  following  update  rule  for  T 


■pO)  _ 

1  gh  ~ 


i 


jO) 


El 


(~iU)  \^n3  (~*U) 

=  1  ^9P  2^q=  1  hq 


(28) 


i.e.,  is  updated  as  the  mean  of  the  relations  between  the 
objects  of  the  gth  type  from  the  gth  cluster  and  from  the 
hth  cluster. 

Each  heterogeneous  relation  expectation  parameter  Y ^ 
is  updated  as  the  mean  of  the  objects  of  the  ith  type  from 
the  gth  cluster  and  of  the  gth  type  from  the  hth  cluster, 


p(*i) 

Lgh 


y  — i  — i 

Eni 

p= i gp  2^q= i 


R 


(*j) 


1  (J) 

J  hq 


(29) 


The  hard  version  of  EF-MMRC  algorithm  is  summarized 
in  Algorithm  2.  It  works  simply  as  the  classic  k-means. 
However,  it  is  applicable  to  various  relational  data  under 
various  Bregman  distance  functions  corresponding  to  vari¬ 
ous  assumptions  of  probability  distributions.  Based  on  the 
EM  framework,  its  convergence  is  guaranteed.  When  ap¬ 
plied  to  some  special  cases  of  relational  data,  it  provides 
simple  and  new  algorithms  for  some  important  data  mining 
problems.  For  example,  when  applied  to  the  data  of  one 
homogeneous  relation  matrix  representing  a  graph  affinity 
matrix,  it  provides  a  simple  and  new  graph  partitioning  al¬ 
gorithm. 

Based  on  Algorithms  1  and  2,  there  is  another  version  of 
EF-MMRC,  i.e.,  we  may  combine  soft  and  hard  EF-MMRC 
together  to  have  mixed  EF-MMRC.  For  example,  we  first 
run  hard  EF-MMRC  several  times  as  initialization,  then  run 
soft  EF-MMRC. 
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Algorithm  2  Hard  MMRC  Algorithm 
Input:  A  relational  data  set 

{{F0)}^i,{S0)}^i,{Rw)}^=i},  a  set  of  exponen¬ 
tial  family  distributions  (Bregman  divergences)  assumed 
for  the  data  set. 

Output:  Cluster  indicator  matrices  attribute 

expectation  matrices  homogeneous  relation  ex¬ 
pectation  matrices  and  heterogeneous  relation  ex¬ 

pectation  matrices 

Method: 

1:  Initialize  the  parameters  as  = 

{{AW}f=1,  {e&}?=1,  {f  {T<«>}&=1}. 

2:  repeat 
3:  {E-step} 

4:  Based  on  the  current  parameters,  reassign  cluster  la¬ 
bels  for  each  objects,  i.e.,  update  to  maxi¬ 

mize  the  objective  function  in  Eq.  (9). 

5:  {M-step} 

6:  for  j  =  1  to  m  do 

7:  Compute  0^-*  using  update  rule  (27). 

8:  Compute  using  update  rule  (28). 

9:  for  i  =  1  to  m  do 

10:  Compute  T ^  us  ng  update  rule  (29). 

1 1 :  end  for 

12:  end  for 

13:  n  = 

14:  until  convergence 


5.  A  UNIFIED  VIEW  TO  CLUSTERING 

In  this  section  we  discuss  the  connections  between  exist¬ 
ing  clustering  approaches  and  the  MMRF  model  and  EF- 
MMRF  algorithms.  By  considering  them  as  special  cases  or 
variations  of  the  MMRF  model,  we  show  that  MMRF  pro¬ 
vides  a  unified  view  to  the  existing  clustering  approaches 
from  various  important  data  mining  applications. 

5.1  Semi-supervised  Clustering 

Recently,  semi-supervised  clustering  has  become  a  topic 
of  significant  interest  [4,  46],  which  seeks  to  cluster  a  set  of 
data  points  with  a  set  of  pairwise  constraints. 

Semi-supervised  clustering  can  be  formulated  as  a  special 
case  of  relational  clustering,  clustering  on  the  single-type  re¬ 
lational  data  set  consisting  of  attributes  F  and  homogeneous 
relations  S.  For  semi-supervised  clustering,  Spq  denotes  the 
pairwise  constraint  on  the  pth  object  and  the  gth  object. 

[4]  provides  a  general  model  for  semi-supervised  clustering 
based  on  Hidden  Markov  Random  Fields  (HMRFs).  We 
show  that  it  can  be  formulated  as  a  special  case  of  MMRC 
model.  As  in  [4],  we  define  the  homogeneous  relation  matrix 
S  as  follows, 

ffM{xp,xq)  if  ( xp,xq )  £  M 
fc(xp,xq)  if  (xp,Xq)  £  C 
0  otherwise 

where  M  denotes  a  set  of  must-link  constraints;  C  denotes  a 
set  of  cannot-link  constraints;  fM(xp,xq)  is  a  function  that 
penalizes  the  violation  of  must-link  constraint;  fc(xp,xq) 
is  a  penalty  function  for  cannot-links.  If  we  assume  Gibbs 
distribution  [41]  for  S, 

Pr(S)  =  —  exp(—  Es  M)-  (30) 

v,i 

where  Z\  is  the  normalization  constant.  Since  [4]  focuses  on 


only  hard  clustering,  we  omit  the  soft  member  parameters 
in  the  MMRC  model  to  consider  hard  clustering.  Based  on 
Eq.(30)  and  Eq.(4),  the  likelihood  function  of  hard  semi- 
supervised  clustering  under  MMRC  model  is 

L(0|F)  =  -  exp(—  SP9)  exp(— LV(F,  AC))  (31) 

p,q 

Since  C  is  an  indicator  matrix,  Eq.  (31)  can  be  formulated 
as 

A/ 

L(0|F)  =  Exp(-^SP9)exp(-^  Y  D4F.p,A.g)) 

p,q  g=lP:Cgp  =  l 

(32) 

The  above  likelihood  function  is  equivalent  to  the  objec¬ 
tive  function  of  semi-supervised  clustering  based  on  HMRFs 
[4].  Furthermore,  when  applied  to  optimizing  the  objective 
function  in  Eq.(32),  hard  MMRC  provides  a  family  of  semi- 
supervised  clustering  algorithms  similar  to  HMRF-KMeans 
in  [4];  on  the  other  hand,  soft  EF-MMRC  provides  new  and 
soft  version  semi-supervised  clustering  algorithms. 

5.2  Co-clustering 

Co-clustering  or  bi-clustering  arise  in  many  important  ap¬ 
plications,  such  as  document  clustering,  micro-array  data 
clustering. A  number  of  approaches  [12,  8,  33,  2]  have  been 
proposed  for  co-clustering.  These  efforts  can  be  generalized 
as  solving  the  following  matrix  approximation  problem  [34] , 

argminS(R,  (C(1))tTC(2))  (33) 

where  R  £  IT1*™2  is  the  data  matrix  ,  C(1)  £  {0,l}fclX"1 
and  C(2)  £  {0,  l}fc2Xn2  are  indicator  matrices,  T  £  RfelXfc2 
is  the  relation  representative  matrix,  and  X)  is  a  distance 
function.  For  example,  [12]  uses  KL-divergences  as  the  dis¬ 
tance  function;  [8,  33]  use  Euclidean  distances. 

Co-clustering  is  equivalent  to  clustering  on  relational  data 
of  one  heterogeneous  relation  matrix  R.  Based  on  Eq.(9), 
by  omitting  the  soft  membership  parameters,  maximizing 
log-likelihood  function  of  hard  clustering  on  a  heterogeneous 
relation  matrix  under  the  MMRC  model  is  equivalent  to  the 
minimization  in  (33).  The  algorithms  proposed  in  [12,  8,  33, 
2]  can  be  viewed  as  special  cases  of  hard  EF-MMRC.  At  the 
same  time,  soft  EF-MMRC  provides  another  family  of  new 
algorithms  for  co-clustering. 

Our  previous  work  [34]  proposes  the  relation  summary 
network  model  for  clustering  k-partite  graphs,  which  can 
be  shown  to  be  equivalent  on  clustering  on  relational  data 
of  multiple  heterogeneous  relation  matrices.  The  proposed 
algorithms  in  [34]  can  also  be  viewed  as  special  cases  of  the 
hard  EF-MMRC  algorithm. 

5.3  Graph  Clustering 

Graph  clustering  (partitioning)  is  an  important  problem 
in  many  domains,  such  as  circuit  partitioning,  VLSI  design, 
task  scheduling.  Existing  graph  partitioning  approaches  are 
mainly  based  on  edge  cut  objectives,  such  as  Kernighan- 
Lin  objective  [30],  normalized  cut  [42],  ratio  cut  [7],  ratio 
association[42],  and  min-max  cut  [13]. 

Graph  clustering  is  equivalent  to  clustering  on  single-type 
relational  data  of  one  homogeneous  relation  matrix  S.  The 
log-likelihood  function  of  the  hard  clustering  under  MMRC 
model  is  —  -D</>(S,  (C)tTC).  We  propose  the  following  theo¬ 
rem  to  show  that  the  edge  cut  objectives  are  mathematically 
equivalent  to  a  special  case  of  the  MMRC  model.  Since  most 
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graph  partitioning  objective  functions  use  weighted  indica¬ 
tor  matrix  such  that  CCr  =  R,  where  R  is  an  identity 
matrix,  we  follow  this  formulation  in  the  following  theorem. 

Theorem  2.  With  restricting  T  to  be  the  form  of  r R  for 
r  >  0,  maximizing  the  log-likelihood  of  hard  MMRC  cluster¬ 
ing  on  S  under  normal  distribution,  i.e., 

max  1 1 S  —  (C)T(rR)C||2,  (34) 

ce{o,i}tx“,ccT=it 

is  equivalent  to  the  trace  maximization 

maxfr(CSCT),  (35) 

where  tr  denots  the  trace  of  a  matrix. 

Proof.  Let  L  denote  the  objective  function  in  Eq.  (34). 

L  =  —  1 1 S  —  rCTC||2 

=  — tr((S  —  rCTC)T(S  —  rCTC)) 

=  — tr(STS)  +  2rtr(CTCS)  -  r2tr(CTCCTC) 

=  — tr(STS)  +  2rtr(CSCT)  -  r2k 

The  above  deduction  uses  the  property  of  trace  tr(XY)  = 
tr(YX).  Since  tr(STS),  r  and  k  are  constants,  the  maxi¬ 
mization  of  L  is  equivalent  to  the  maximization  of  tr(CSCT). 
The  proof  is  completed.  □ 

Since  it  is  shown  in  the  literature  [10]  that  the  edge  cut  ob¬ 
jectives  can  be  formulated  as  the  trace  maximization,  The¬ 
orem  2  states  that  edge-cut  based  graph  clustering  is  equiv¬ 
alent  to  MMRC  model  under  normal  distribution  with  the 
diagonal  constraint  on  the  parameter  matrix  P.  This  con¬ 
nection  provides  not  only  a  new  understanding  for  graph 
partitioning  but  also  a  family  of  new  algorithms  (soft  and 
hard  MMRC  algorithms)  for  graph  clustering. 

Finally,  we  point  out  that  MMRC  model  does  not  ex¬ 
clude  traditional  attribute-based  clustering.  When  applied 
to  an  attribute  data  matrix  under  Euclidean  distances,  hard 
MMRC  algorithm  is  actually  reduced  to  the  classic  k-means; 
soft  MMRC  algorithm  is  very  close  to  the  traditional  mix¬ 
ture  model  EM  clustering  except  that  it  does  not  involve 
mixing  proportions  in  the  computation. 

In  summary,  MMRC  model  provides  a  principal  frame¬ 
work  to  unify  various  important  clustering  tasks  includ¬ 
ing  traditional  attributes-based  clustering,  semi-supervised 
clustering,  co-clustering  and  graph  clustering;  soft  and  hard 
EF-MMRC  algorithms  unify  a  number  of  stat-of-the-art  clus¬ 
tering  algorithms  and  at  the  same  time  provide  new  solu¬ 
tions  to  various  clustering  tasks. 

6.  EXPERIMENTS 

This  section  provides  empirical  evidence  to  show  the  ef¬ 
fectiveness  of  the  MMRC  model  and  algorithms.  Since  a 
number  of  stat-of-the-art  clustering  algorithms  [12,  8,  33, 
2,  3,  4]  can  be  viewed  as  special  cases  of  EF-MMRC  model 
and  algorithms,  the  experimental  results  in  these  efforts  also 
illustrate  the  effectiveness  of  the  MMRC  model  and  algo¬ 
rithms.  In  this  paper,  we  apply  MMRC  algorithms  to  tasks 
of  graph  clustering,  bi-clustering,  tri-clusering,  and  cluster¬ 
ing  on  a  general  relational  data  set  of  all  three  types  of  infor¬ 
mation.  In  the  experiments,  we  use  mixed  version  MMRC, 
i.e.,  hard  MMRC  initialization  followed  by  soft  MMRC.  Al¬ 
though  MMRC  can  adopt  various  distribution  assumptions, 
due  to  space  limit,  we  use  MMRC  under  normal  or  Poisson 
distribution  assumption  in  the  experiments.  However,  this 


Table  1:  Summary  of  relational  data  for  Graph  Clus¬ 
tering. 


Name 

n 

k 

Balance 

Source 

trll 

414 

9 

0.046 

TREC 

tr23 

204 

6 

0.066 

TREC 

NGl-20 

14000 

20 

1.0 

20-newsgroups 

klb 

2340 

6 

0.043 

WebACE 

0.7 
0.65 
0.6 
0.55 
0.5 
|  0.45 
0.4 
0.35 
0.3 
0.25 
0.2 

Figure  2:  NMI  comparison  of  SGP,  METIS  and 
MMRC  algorithms. 

does  not  imply  that  they  are  optimal  distribution  assump¬ 
tions  for  the  data.  How  to  decide  the  optimal  distribution 
assumption  is  beyond  the  scope  of  this  paper. 

For  performance  measure,  we  elect  to  use  the  Normalized 
Mutual  Information  (NMI)  [44]  between  the  resulting  cluster 
labels  and  the  true  cluster  labels,  which  is  a  standard  way 
to  measure  the  cluster  quality.  The  final  performance  score 
is  the  average  of  ten  runs. 

6.1  Graph  Clustering 

In  this  section,  we  present  experiments  on  the  MMRC 
algorithm  under  normal  distribution  in  comparison  with  two 
representative  graph  partitioning  algorithms,  the  spectral 
graph  partitioning  (SGP)  from  [36]  that  is  generalized  to 
work  with  both  normalized  cut  and  ratio  association,  and 
the  classic  multilevel  algorithm,  METIS  [28]. 

The  graphs  based  on  the  text  data  have  been  widely  used 
to  test  graph  partitioning  algorithms  [13,  11,  25].  In  this 
study,  we  use  various  data  sets  from  the  20-newsgroups  [32], 
Web  ACE  and  TREC  [27],  which  cover  data  sets  of  different 
sizes,  different  balances  and  different  levels  of  difficulties. 
The  data  are  pre-processed  by  removing  the  stop  words  and 
each  document  is  represented  by  a  term-frequency  vector 
using  TF-IDF  weights.  Then  we  construct  relational  data 
for  each  text  data  set  such  that  objects  (documents)  are 
related  to  each  other  with  cosine  similarities  between  the 
term-frequency  vectors.  A  summary  of  all  the  data  sets 
to  construct  relational  data  used  in  this  paper  is  shown  in 
Table  1,  in  which  n  denotes  the  number  of  objects  in  the 
relational  data,  k  denotes  the  number  of  true  clusters,  and 
balance  denotes  the  size  ratio  of  the  smallest  clusters  to  the 
largest  clusters. 

For  the  number  of  clusters  k,  we  simply  use  the  number 
of  the  true  clusters.  Note  that  how  to  choose  the  optimal 
number  of  clusters  is  a  nontrivial  model  selection  problem 
and  beyond  the  scope  of  this  paper. 

Figure  2  shows  the  NMI  comparison  of  the  three  algo¬ 
rithms.  We  observe  that  although  there  is  no  single  winner 
on  all  the  graphs,  overall  the  MMRC  algorithm  performs 
better  than  SGP  and  METIS.  Especially  on  the  difficult 
data  set  tr23,  MMRC  increases  performance  about  30%. 
Hence,  MMRC  under  normal  distribution  provides  a  new 
graph  partitioning  algorithm  which  is  viable  and  competi- 
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APPENDIX  C 


Data  set 

Taxonomy  structure 

TT-TMl 

{rec. sport. baseball,  rec. sport. hockey}, 
{talk. politics. guns,  talk. politics. mideast, 
talk,  politics,  misc} 

TT-  TM2 

{comp. graphics,  comp. os. ms- windows. misc} , 

{rec. autos,  rec. motorcycles}, 

{sci. crypt,  sci. electronics} 

Table  3:  Taxonomy  structures  of  two  data  sets  for 
constructing  tri-partite  relational  data 


BT-NG1  BT-NG2  BT-NG3 


Figure  3:  NMI  comparison  of  BSGP,  RSN  and 
MMRC  algorithms  for  bi-type  data. 

five  compared  with  the  two  existing  state-of-the-art  graph 
partitioning  algorithms.  Note  that  although  the  normal  dis¬ 
tribution  is  most  popular,  MMRC  under  other  distribution 
assumptions  may  be  more  desirable  in  specific  graph  clus¬ 
tering  applications  depends  on  the  statistical  properties  of 
the  graphs. 

6.2  Bi-clustering  and  Tri-clustering 

In  this  section,  we  apply  the  MMRC  algorithm  under 
Poisson  distribution  to  clustering  bi-type  relational  data, 
word-document  data,  and  tri-type  relational  data,  word- 
document-category  data.  Two  algorithms,  Bi-partite  Spec¬ 
tral  Graph  partitioning  (BSGP)  [11]  and  Relation  Summary 
Network  under  Generalized  I-divergence  (RSN-GI)  [34],  are 
used  as  comparison  in  bi-clustering.  For  tri-clustering,  Con¬ 
sistent  Bipartite  Graph  Co-partitioning  (CBGC)  [18]  and 
RSN-GI  are  used  as  comparison. 

The  bi-type  relational  data,  word-document  data,  are  con¬ 
structed  based  on  various  subsets  of  the  20-Newsgroup  data. 
We  pre-process  the  data  by  selecting  the  top  2000  words 
by  the  mutual  information.  The  document-word  matrix  is 
based  on  tf.idf  weighting  scheme  and  each  document  vector 
is  normalized  to  a  unit  L2  norm  vector.  Specific  details  of 
the  data  sets  are  listed  in  Table  2.  For  example,  for  the  data 
set  BT-NG3  we  randomly  and  evenly  sample  200  documents 
from  the  corresponding  newsgroups;  then  we  formulate  a  bi¬ 
type  relational  data  set  of  1600  document  and  2000  word. 

The  tri-type  relational  data  are  built  based  on  the  20- 
newsgroups  data  for  hierarchical  taxonomy  mining.  In  the 
field  of  text  categorization,  hierarchical  taxonomy  classifica- 


TT-TM1  TT-TM2 


Figure  4:  NMI  comparison  of  CBGC,  RSN  and 
MMRC  algorithms  for  tri-type  data. 


cluster  23  of  actors 
Viggo  Mortensen,  Sean  Bean,  Miranda  Otto, 

Ian  Holm,  Christopher  Lee,  Cate  Blanchett, 

Ian  McKellen  ,Liv  Tyler  ,  David  Wenham  , 

Brad  Dourif  ,  John  Rhys-Davies  ,  Elijah  Wood  , 
Bernard  Hill  ,  Sean  Astin  ,  Andy  Serkis  , 

Dominic  Monaghan  ,  Karl  Urban  ,  Orlando  Bloom  , 
Billy  Boyd  ,John  Noble,  Sala  Baker 
cluster  118  of  movies 

The  Lord  of  the  Rings:  The  Fellowship  of  the  Ring  (2001) 
The  Lord  of  the  Rings:  The  Two  Towers  (2002) 

The  Lord  of  the  Rings:  The  Return  of  the  King  (2003) 


Table  4:  Two  clusters  from  actor-movie  data 

tion  is  widely  used  to  obtain  a  better  trade-off  between  effec¬ 
tiveness  and  efficiency  than  flat  taxonomy  classification.  To 
take  advantage  of  hierarchical  classification,  one  must  mine  a 
hierarchical  taxonomy  from  the  data  set.  We  see  that  words, 
documents,  and  categories  formulate  a  sandwich  structure 
tri-type  relational  data  set,  in  which  documents  are  central 
type  nodes.  The  links  between  documents  and  categories 
are  constructed  such  that  if  a  document  belongs  to  k  cate¬ 
gories,  the  weights  of  links  between  this  document  and  these 
k  category  nodes  are  1/k  (please  refer  [18]  for  details).  The 
true  taxonomy  structures  for  two  data  sets,  TP-TM1  and 
TP-TM2,  are  documented  in  Table  3. 

Figure  3  and  Figure  4  show  the  NMI  comparison  of  the 
three  algorithms  on  bi-type  and  tri-type  relational  data,  re¬ 
spectively.  We  observe  that  the  MMRC  algorithm  performs 
significantly  better  than  BSGP  and  CBGC.  MMRC  per¬ 
forms  slightly  better  than  RSN  on  some  data  sets.  Since 
RSN  is  a  special  case  of  hard  MMRC,  this  shows  that  mixed 
MMRC  improves  hard  MMRC’s  performance  on  the  data 
sets.  Therefore,  compared  with  the  existing  stated-of-the- 
art  algorithms,  the  MMRC  algorithm  performs  more  effec¬ 
tively  on  these  bi-clustering  or  tri-clustering  tasks  and  on 
the  other  hand,  it  is  flexible  for  different  types  of  multi¬ 
clustering  tasks  which  may  be  more  complicated  than  tri¬ 
type  clustering. 

6.3  A  Case  Study  on  Actor-movie  Data 

We  also  run  the  MMRC  algorithm  on  the  actor-movie  re¬ 
lational  data  based  on  IMDB  movie  data  set  for  a  case  study. 
In  the  data,  actors  are  related  to  each  other  by  collabora¬ 
tion  (homogeneous  relations);  actors  are  related  to  movies 
by  taking  roles  in  movies  (heterogeneous  relations);  movies 
have  attributes  such  as  release  time  and  rating  (note  that 
there  is  no  links  between  movies).  Hence  the  data  have  all 
the  three  types  of  information.  We  formulate  a  data  set  of 
20000  actors  and  4000  movies.  We  run  experiments  with 
k  =  200.  Although  there  is  no  ground  truth  for  the  data’s 
cluster  structure,  we  observe  that  most  resulting  clusters 
that  are  actors  or  movies  of  the  similar  style  such  as  action, 
or  tight  groups  from  specific  movie  serials.  For  example,  Ta¬ 
ble  4  shows  cluster  23  of  actors  and  cluster  118  of  movies;  the 
parameter  T23,ns  shows  that  these  two  clusters  are  strongly 
related  to  each  other.  In  fact,  the  actor  cluster  contains  the 
actors  in  the  movie  series  ’’The  Lord  of  the  Rings”.  Note 
that  if  we  only  have  one  type  of  actor  objects,  we  only  get 
the  actor  clusters,  but  with  two  types  of  nodes,  although 
there  is  no  links  between  the  movies,  we  also  get  the  related 
movie  clusters  to  explain  how  the  actors  are  related. 

7.  CONCLUSIONS 

In  this  paper,  we  propose  a  probabilistic  model  for  re¬ 
lational  clustering,  which  provides  a  principal  framework 
to  unify  various  important  clustering  tasks  including  tradi- 
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Dataset 

Name 

Newsgroups  Included 

Documents 
per  Group 

Total  # 
Documents 

BT-NG1 

rec .sport . baseball ,  rec . sport . hockey 

200 

400 

BT-NG2 

comp. os. ms-windows. misc,  comp. windows. x,  rec. motorcycles, 
sci. crypt,  sci. space 

200 

1000 

BT-NG3 

comp. os. ms-windows. misc,  comp. windows. x,  misc.forsale, 
rec. motorcycles, rec. motorcycles, sci. crypt,  sci. space, 
talk. politics. mideast,  talk. religion. misc 

200 

1600 

Table  2:  Subsets  of  Newsgroup  Data  for  bi-type  relational  data 


tional  attributes-based  clustering,  semi-supervised  cluster¬ 
ing,  co-clustering  and  graph  clustering.  Under  this  model, 
we  propose  parametric  hard  and  soft  relational  clustering 
algorithms  under  a  large  number  of  exponential  family  dis¬ 
tributions.  The  algorithms  are  applicable  to  relational  data 
of  various  structures  and  at  the  same  time  unify  a  number 
of  stat-of-the-art  clustering  algorithms.  The  theoretic  analy¬ 
sis  and  experimental  evaluation  show  the  effectiveness  and 
great  potential  of  the  proposed  model  and  algorithms. 
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